Index: linux-2.6-tip/Documentation/ABI/testing/debugfs-kmemtrace =================================================================== --- /dev/null +++ linux-2.6-tip/Documentation/ABI/testing/debugfs-kmemtrace @@ -0,0 +1,71 @@ +What: /sys/kernel/debug/kmemtrace/ +Date: July 2008 +Contact: Eduard - Gabriel Munteanu +Description: + +In kmemtrace-enabled kernels, the following files are created: + +/sys/kernel/debug/kmemtrace/ + cpu (0400) Per-CPU tracing data, see below. (binary) + total_overruns (0400) Total number of bytes which were dropped from + cpu files because of full buffer condition, + non-binary. (text) + abi_version (0400) Kernel's kmemtrace ABI version. (text) + +Each per-CPU file should be read according to the relay interface. That is, +the reader should set affinity to that specific CPU and, as currently done by +the userspace application (though there are other methods), use poll() with +an infinite timeout before every read(). Otherwise, erroneous data may be +read. The binary data has the following _core_ format: + + Event ID (1 byte) Unsigned integer, one of: + 0 - represents an allocation (KMEMTRACE_EVENT_ALLOC) + 1 - represents a freeing of previously allocated memory + (KMEMTRACE_EVENT_FREE) + Type ID (1 byte) Unsigned integer, one of: + 0 - this is a kmalloc() / kfree() + 1 - this is a kmem_cache_alloc() / kmem_cache_free() + 2 - this is a __get_free_pages() et al. + Event size (2 bytes) Unsigned integer representing the + size of this event. Used to extend + kmemtrace. Discard the bytes you + don't know about. + Sequence number (4 bytes) Signed integer used to reorder data + logged on SMP machines. Wraparound + must be taken into account, although + it is unlikely. + Caller address (8 bytes) Return address to the caller. + Pointer to mem (8 bytes) Pointer to target memory area. Can be + NULL, but not all such calls might be + recorded. + +In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow: + + Requested bytes (8 bytes) Total number of requested bytes, + unsigned, must not be zero. + Allocated bytes (8 bytes) Total number of actually allocated + bytes, unsigned, must not be lower + than requested bytes. + Requested flags (4 bytes) GFP flags supplied by the caller. + Target CPU (4 bytes) Signed integer, valid for event id 1. + If equal to -1, target CPU is the same + as origin CPU, but the reverse might + not be true. + +The data is made available in the same endianness the machine has. + +Other event ids and type ids may be defined and added. Other fields may be +added by increasing event size, but see below for details. +Every modification to the ABI, including new id definitions, are followed +by bumping the ABI version by one. + +Adding new data to the packet (features) is done at the end of the mandatory +data: + Feature size (2 byte) + Feature ID (1 byte) + Feature data (Feature size - 3 bytes) + + +Users: + kmemtrace-user - git://repo.or.cz/kmemtrace-user.git + Index: linux-2.6-tip/Documentation/cputopology.txt =================================================================== --- linux-2.6-tip.orig/Documentation/cputopology.txt +++ linux-2.6-tip/Documentation/cputopology.txt @@ -18,11 +18,11 @@ For an architecture to support this feat these macros in include/asm-XXX/topology.h: #define topology_physical_package_id(cpu) #define topology_core_id(cpu) -#define topology_thread_siblings(cpu) -#define topology_core_siblings(cpu) +#define topology_thread_cpumask(cpu) +#define topology_core_cpumask(cpu) The type of **_id is int. -The type of siblings is cpumask_t. +The type of siblings is (const) struct cpumask *. To be consistent on all architectures, include/linux/topology.h provides default definitions for any of the above macros that are Index: linux-2.6-tip/Documentation/ftrace.txt =================================================================== --- linux-2.6-tip.orig/Documentation/ftrace.txt +++ linux-2.6-tip/Documentation/ftrace.txt @@ -15,31 +15,31 @@ Introduction Ftrace is an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel. -It can be used for debugging or analyzing latencies and performance -issues that take place outside of user-space. +It can be used for debugging or analyzing latencies and +performance issues that take place outside of user-space. Although ftrace is the function tracer, it also includes an -infrastructure that allows for other types of tracing. Some of the -tracers that are currently in ftrace include a tracer to trace -context switches, the time it takes for a high priority task to -run after it was woken up, the time interrupts are disabled, and -more (ftrace allows for tracer plugins, which means that the list of -tracers can always grow). +infrastructure that allows for other types of tracing. Some of +the tracers that are currently in ftrace include a tracer to +trace context switches, the time it takes for a high priority +task to run after it was woken up, the time interrupts are +disabled, and more (ftrace allows for tracer plugins, which +means that the list of tracers can always grow). The File System --------------- -Ftrace uses the debugfs file system to hold the control files as well -as the files to display output. +Ftrace uses the debugfs file system to hold the control files as +well as the files to display output. To mount the debugfs system: # mkdir /debug # mount -t debugfs nodev /debug -(Note: it is more common to mount at /sys/kernel/debug, but for simplicity - this document will use /debug) +( Note: it is more common to mount at /sys/kernel/debug, but for + simplicity this document will use /debug) That's it! (assuming that you have ftrace configured into your kernel) @@ -50,90 +50,124 @@ of ftrace. Here is a list of some of the Note: all time values are in microseconds. - current_tracer: This is used to set or display the current tracer - that is configured. + current_tracer: - available_tracers: This holds the different types of tracers that - have been compiled into the kernel. The tracers - listed here can be configured by echoing their name - into current_tracer. - - tracing_enabled: This sets or displays whether the current_tracer - is activated and tracing or not. Echo 0 into this - file to disable the tracer or 1 to enable it. - - trace: This file holds the output of the trace in a human readable - format (described below). - - latency_trace: This file shows the same trace but the information - is organized more to display possible latencies - in the system (described below). - - trace_pipe: The output is the same as the "trace" file but this - file is meant to be streamed with live tracing. - Reads from this file will block until new data - is retrieved. Unlike the "trace" and "latency_trace" - files, this file is a consumer. This means reading - from this file causes sequential reads to display - more current data. Once data is read from this - file, it is consumed, and will not be read - again with a sequential read. The "trace" and - "latency_trace" files are static, and if the - tracer is not adding more data, they will display - the same information every time they are read. - - trace_options: This file lets the user control the amount of data - that is displayed in one of the above output - files. - - trace_max_latency: Some of the tracers record the max latency. - For example, the time interrupts are disabled. - This time is saved in this file. The max trace - will also be stored, and displayed by either - "trace" or "latency_trace". A new max trace will - only be recorded if the latency is greater than - the value in this file. (in microseconds) - - buffer_size_kb: This sets or displays the number of kilobytes each CPU - buffer can hold. The tracer buffers are the same size - for each CPU. The displayed number is the size of the - CPU buffer and not total size of all buffers. The - trace buffers are allocated in pages (blocks of memory - that the kernel uses for allocation, usually 4 KB in size). - If the last page allocated has room for more bytes - than requested, the rest of the page will be used, - making the actual allocation bigger than requested. - (Note, the size may not be a multiple of the page size due - to buffer managment overhead.) - - This can only be updated when the current_tracer - is set to "nop". - - tracing_cpumask: This is a mask that lets the user only trace - on specified CPUS. The format is a hex string - representing the CPUS. - - set_ftrace_filter: When dynamic ftrace is configured in (see the - section below "dynamic ftrace"), the code is dynamically - modified (code text rewrite) to disable calling of the - function profiler (mcount). This lets tracing be configured - in with practically no overhead in performance. This also - has a side effect of enabling or disabling specific functions - to be traced. Echoing names of functions into this file - will limit the trace to only those functions. - - set_ftrace_notrace: This has an effect opposite to that of - set_ftrace_filter. Any function that is added here will not - be traced. If a function exists in both set_ftrace_filter - and set_ftrace_notrace, the function will _not_ be traced. - - set_ftrace_pid: Have the function tracer only trace a single thread. - - available_filter_functions: This lists the functions that ftrace - has processed and can trace. These are the function - names that you can pass to "set_ftrace_filter" or - "set_ftrace_notrace". (See the section "dynamic ftrace" - below for more details.) + This is used to set or display the current tracer + that is configured. + + available_tracers: + + This holds the different types of tracers that + have been compiled into the kernel. The + tracers listed here can be configured by + echoing their name into current_tracer. + + tracing_enabled: + + This sets or displays whether the current_tracer + is activated and tracing or not. Echo 0 into this + file to disable the tracer or 1 to enable it. + + trace: + + This file holds the output of the trace in a human + readable format (described below). + + latency_trace: + + This file shows the same trace but the information + is organized more to display possible latencies + in the system (described below). + + trace_pipe: + + The output is the same as the "trace" file but this + file is meant to be streamed with live tracing. + Reads from this file will block until new data + is retrieved. Unlike the "trace" and "latency_trace" + files, this file is a consumer. This means reading + from this file causes sequential reads to display + more current data. Once data is read from this + file, it is consumed, and will not be read + again with a sequential read. The "trace" and + "latency_trace" files are static, and if the + tracer is not adding more data, they will display + the same information every time they are read. + + trace_options: + + This file lets the user control the amount of data + that is displayed in one of the above output + files. + + trace_max_latency: + + Some of the tracers record the max latency. + For example, the time interrupts are disabled. + This time is saved in this file. The max trace + will also be stored, and displayed by either + "trace" or "latency_trace". A new max trace will + only be recorded if the latency is greater than + the value in this file. (in microseconds) + + buffer_size_kb: + + This sets or displays the number of kilobytes each CPU + buffer can hold. The tracer buffers are the same size + for each CPU. The displayed number is the size of the + CPU buffer and not total size of all buffers. The + trace buffers are allocated in pages (blocks of memory + that the kernel uses for allocation, usually 4 KB in size). + If the last page allocated has room for more bytes + than requested, the rest of the page will be used, + making the actual allocation bigger than requested. + ( Note, the size may not be a multiple of the page size + due to buffer managment overhead. ) + + This can only be updated when the current_tracer + is set to "nop". + + tracing_cpumask: + + This is a mask that lets the user only trace + on specified CPUS. The format is a hex string + representing the CPUS. + + set_ftrace_filter: + + When dynamic ftrace is configured in (see the + section below "dynamic ftrace"), the code is dynamically + modified (code text rewrite) to disable calling of the + function profiler (mcount). This lets tracing be configured + in with practically no overhead in performance. This also + has a side effect of enabling or disabling specific functions + to be traced. Echoing names of functions into this file + will limit the trace to only those functions. + + set_ftrace_notrace: + + This has an effect opposite to that of + set_ftrace_filter. Any function that is added here will not + be traced. If a function exists in both set_ftrace_filter + and set_ftrace_notrace, the function will _not_ be traced. + + set_ftrace_pid: + + Have the function tracer only trace a single thread. + + set_graph_function: + + Set a "trigger" function where tracing should start + with the function graph tracer (See the section + "dynamic ftrace" for more details). + + available_filter_functions: + + This lists the functions that ftrace + has processed and can trace. These are the function + names that you can pass to "set_ftrace_filter" or + "set_ftrace_notrace". (See the section "dynamic ftrace" + below for more details.) The Tracers @@ -141,36 +175,66 @@ The Tracers Here is the list of current tracers that may be configured. - function - function tracer that uses mcount to trace all functions. + "function" + + Function call tracer to trace all kernel functions. + + "function_graph_tracer" + + Similar to the function tracer except that the + function tracer probes the functions on their entry + whereas the function graph tracer traces on both entry + and exit of the functions. It then provides the ability + to draw a graph of function calls similar to C code + source. + + "sched_switch" + + Traces the context switches and wakeups between tasks. + + "irqsoff" + + Traces the areas that disable interrupts and saves + the trace with the longest max latency. + See tracing_max_latency. When a new max is recorded, + it replaces the old trace. It is best to view this + trace via the latency_trace file. + + "preemptoff" + + Similar to irqsoff but traces and records the amount of + time for which preemption is disabled. - sched_switch - traces the context switches between tasks. + "preemptirqsoff" - irqsoff - traces the areas that disable interrupts and saves - the trace with the longest max latency. - See tracing_max_latency. When a new max is recorded, - it replaces the old trace. It is best to view this - trace via the latency_trace file. + Similar to irqsoff and preemptoff, but traces and + records the largest time for which irqs and/or preemption + is disabled. - preemptoff - Similar to irqsoff but traces and records the amount of - time for which preemption is disabled. + "wakeup" - preemptirqsoff - Similar to irqsoff and preemptoff, but traces and - records the largest time for which irqs and/or preemption - is disabled. + Traces and records the max latency that it takes for + the highest priority task to get scheduled after + it has been woken up. - wakeup - Traces and records the max latency that it takes for - the highest priority task to get scheduled after - it has been woken up. + "hw-branch-tracer" - nop - This is not a tracer. To remove all tracers from tracing - simply echo "nop" into current_tracer. + Uses the BTS CPU feature on x86 CPUs to traces all + branches executed. + + "nop" + + This is the "trace nothing" tracer. To remove all + tracers from tracing simply echo "nop" into + current_tracer. Examples of using the tracer ---------------------------- -Here are typical examples of using the tracers when controlling them only -with the debugfs interface (without using any user-land utilities). +Here are typical examples of using the tracers when controlling +them only with the debugfs interface (without using any +user-land utilities). Output format: -------------- @@ -187,16 +251,16 @@ Here is an example of the output format bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput -------- -A header is printed with the tracer name that is represented by the trace. -In this case the tracer is "function". Then a header showing the format. Task -name "bash", the task PID "4251", the CPU that it was running on -"01", the timestamp in . format, the function name that was -traced "path_put" and the parent function that called this function -"path_walk". The timestamp is the time at which the function was -entered. +A header is printed with the tracer name that is represented by +the trace. In this case the tracer is "function". Then a header +showing the format. Task name "bash", the task PID "4251", the +CPU that it was running on "01", the timestamp in . +format, the function name that was traced "path_put" and the +parent function that called this function "path_walk". The +timestamp is the time at which the function was entered. -The sched_switch tracer also includes tracing of task wakeups and -context switches. +The sched_switch tracer also includes tracing of task wakeups +and context switches. ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S @@ -205,8 +269,8 @@ context switches. kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R -Wake ups are represented by a "+" and the context switches are shown as -"==>". The format is: +Wake ups are represented by a "+" and the context switches are +shown as "==>". The format is: Context switches: @@ -220,19 +284,20 @@ Wake ups are represented by a "+" and th :: + :: -The prio is the internal kernel priority, which is the inverse of the -priority that is usually displayed by user-space tools. Zero represents -the highest priority (99). Prio 100 starts the "nice" priorities with -100 being equal to nice -20 and 139 being nice 19. The prio "140" is -reserved for the idle task which is the lowest priority thread (pid 0). +The prio is the internal kernel priority, which is the inverse +of the priority that is usually displayed by user-space tools. +Zero represents the highest priority (99). Prio 100 starts the +"nice" priorities with 100 being equal to nice -20 and 139 being +nice 19. The prio "140" is reserved for the idle task which is +the lowest priority thread (pid 0). Latency trace format -------------------- -For traces that display latency times, the latency_trace file gives -somewhat more information to see why a latency happened. Here is a typical -trace. +For traces that display latency times, the latency_trace file +gives somewhat more information to see why a latency happened. +Here is a typical trace. # tracer: irqsoff # @@ -259,20 +324,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-r -0 0d.s1 98us : trace_hardirqs_on (do_softirq) +This shows that the current tracer is "irqsoff" tracing the time +for which interrupts were disabled. It gives the trace version +and the version of the kernel upon which this was executed on +(2.6.26-rc8). Then it displays the max latency in microsecs (97 +us). The number of trace entries displayed and the total number +recorded (both are three: #3/3). The type of preemption that was +used (PREEMPT). VP, KP, SP, and HP are always zero and are +reserved for later use. #P is the number of online CPUS (#P:2). -This shows that the current tracer is "irqsoff" tracing the time for which -interrupts were disabled. It gives the trace version and the version -of the kernel upon which this was executed on (2.6.26-rc8). Then it displays -the max latency in microsecs (97 us). The number of trace entries displayed -and the total number recorded (both are three: #3/3). The type of -preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero -and are reserved for later use. #P is the number of online CPUS (#P:2). - -The task is the process that was running when the latency occurred. -(swapper pid: 0). +The task is the process that was running when the latency +occurred. (swapper pid: 0). -The start and stop (the functions in which the interrupts were disabled and -enabled respectively) that caused the latencies: +The start and stop (the functions in which the interrupts were +disabled and enabled respectively) that caused the latencies: apic_timer_interrupt is where the interrupts were disabled. do_softirq is where they were enabled again. @@ -308,12 +373,12 @@ The above is mostly meaningful for kerne latency_trace file is relative to the start of the trace. delay: This is just to help catch your eye a bit better. And - needs to be fixed to be only relative to the same CPU. - The marks are determined by the difference between this - current trace and the next trace. - '!' - greater than preempt_mark_thresh (default 100) - '+' - greater than 1 microsecond - ' ' - less than or equal to 1 microsecond. + needs to be fixed to be only relative to the same CPU. + The marks are determined by the difference between this + current trace and the next trace. + '!' - greater than preempt_mark_thresh (default 100) + '+' - greater than 1 microsecond + ' ' - less than or equal to 1 microsecond. The rest is the same as the 'trace' file. @@ -321,14 +386,15 @@ The above is mostly meaningful for kerne trace_options ------------- -The trace_options file is used to control what gets printed in the trace -output. To see what is available, simply cat the file: +The trace_options file is used to control what gets printed in +the trace output. To see what is available, simply cat the file: cat /debug/tracing/trace_options print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ - noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj + noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj -To disable one of the options, echo in the option prepended with "no". +To disable one of the options, echo in the option prepended with +"no". echo noprint-parent > /debug/tracing/trace_options @@ -338,8 +404,8 @@ To enable an option, leave off the "no". Here are the available options: - print-parent - On function traces, display the calling function - as well as the function being traced. + print-parent - On function traces, display the calling (parent) + function as well as the function being traced. print-parent: bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul @@ -348,15 +414,16 @@ Here are the available options: bash-4000 [01] 1477.606694: simple_strtoul - sym-offset - Display not only the function name, but also the offset - in the function. For example, instead of seeing just - "ktime_get", you will see "ktime_get+0xb/0x20". + sym-offset - Display not only the function name, but also the + offset in the function. For example, instead of + seeing just "ktime_get", you will see + "ktime_get+0xb/0x20". sym-offset: bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 - sym-addr - this will also display the function address as well as - the function name. + sym-addr - this will also display the function address as well + as the function name. sym-addr: bash-4000 [01] 1477.606694: simple_strtoul @@ -366,35 +433,41 @@ Here are the available options: bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ (+0.000ms): simple_strtoul (strict_strtoul) - raw - This will display raw numbers. This option is best for use with - user applications that can translate the raw numbers better than - having it done in the kernel. + raw - This will display raw numbers. This option is best for + use with user applications that can translate the raw + numbers better than having it done in the kernel. - hex - Similar to raw, but the numbers will be in a hexadecimal format. + hex - Similar to raw, but the numbers will be in a hexadecimal + format. bin - This will print out the formats in raw binary. block - TBD (needs update) - stacktrace - This is one of the options that changes the trace itself. - When a trace is recorded, so is the stack of functions. - This allows for back traces of trace sites. - - userstacktrace - This option changes the trace. - It records a stacktrace of the current userspace thread. - - sym-userobj - when user stacktrace are enabled, look up which object the - address belongs to, and print a relative address - This is especially useful when ASLR is on, otherwise you don't - get a chance to resolve the address to object/file/line after the app is no - longer running + stacktrace - This is one of the options that changes the trace + itself. When a trace is recorded, so is the stack + of functions. This allows for back traces of + trace sites. + + userstacktrace - This option changes the trace. It records a + stacktrace of the current userspace thread. + + sym-userobj - when user stacktrace are enabled, look up which + object the address belongs to, and print a + relative address. This is especially useful when + ASLR is on, otherwise you don't get a chance to + resolve the address to object/file/line after + the app is no longer running - The lookup is performed when you read trace,trace_pipe,latency_trace. Example: + The lookup is performed when you read + trace,trace_pipe,latency_trace. Example: a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] - sched-tree - TBD (any users??) + sched-tree - trace all tasks that are on the runqueue, at + every scheduling event. Will add overhead if + there's a lot of tasks running at once. sched_switch @@ -431,18 +504,19 @@ of how to use it. [...] -As we have discussed previously about this format, the header shows -the name of the trace and points to the options. The "FUNCTION" -is a misnomer since here it represents the wake ups and context -switches. - -The sched_switch file only lists the wake ups (represented with '+') -and context switches ('==>') with the previous task or current task -first followed by the next task or task waking up. The format for both -of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO -is the inverse of the actual priority with zero (0) being the highest -priority and the nice values starting at 100 (nice -20). Below is -a quick chart to map the kernel priority to user land priorities. +As we have discussed previously about this format, the header +shows the name of the trace and points to the options. The +"FUNCTION" is a misnomer since here it represents the wake ups +and context switches. + +The sched_switch file only lists the wake ups (represented with +'+') and context switches ('==>') with the previous task or +current task first followed by the next task or task waking up. +The format for both of these is PID:KERNEL-PRIO:TASK-STATE. +Remember that the KERNEL-PRIO is the inverse of the actual +priority with zero (0) being the highest priority and the nice +values starting at 100 (nice -20). Below is a quick chart to map +the kernel priority to user land priorities. Kernel priority: 0 to 99 ==> user RT priority 99 to 0 Kernel priority: 100 to 139 ==> user nice -20 to 19 @@ -463,10 +537,10 @@ The task states are: ftrace_enabled -------------- -The following tracers (listed below) give different output depending -on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled, -one can either use the sysctl function or set it via the proc -file system interface. +The following tracers (listed below) give different output +depending on whether or not the sysctl ftrace_enabled is set. To +set ftrace_enabled, one can either use the sysctl function or +set it via the proc file system interface. sysctl kernel.ftrace_enabled=1 @@ -474,12 +548,12 @@ file system interface. echo 1 > /proc/sys/kernel/ftrace_enabled -To disable ftrace_enabled simply replace the '1' with '0' in -the above commands. +To disable ftrace_enabled simply replace the '1' with '0' in the +above commands. -When ftrace_enabled is set the tracers will also record the functions -that are within the trace. The descriptions of the tracers -will also show an example with ftrace enabled. +When ftrace_enabled is set the tracers will also record the +functions that are within the trace. The descriptions of the +tracers will also show an example with ftrace enabled. irqsoff @@ -487,17 +561,18 @@ irqsoff When interrupts are disabled, the CPU can not react to any other external event (besides NMIs and SMIs). This prevents the timer -interrupt from triggering or the mouse interrupt from letting the -kernel know of a new mouse event. The result is a latency with the -reaction time. - -The irqsoff tracer tracks the time for which interrupts are disabled. -When a new maximum latency is hit, the tracer saves the trace leading up -to that latency point so that every time a new maximum is reached, the old -saved trace is discarded and the new trace is saved. +interrupt from triggering or the mouse interrupt from letting +the kernel know of a new mouse event. The result is a latency +with the reaction time. + +The irqsoff tracer tracks the time for which interrupts are +disabled. When a new maximum latency is hit, the tracer saves +the trace leading up to that latency point so that every time a +new maximum is reached, the old saved trace is discarded and the +new trace is saved. -To reset the maximum, echo 0 into tracing_max_latency. Here is an -example: +To reset the maximum, echo 0 into tracing_max_latency. Here is +an example: # echo irqsoff > /debug/tracing/current_tracer # echo 0 > /debug/tracing/tracing_max_latency @@ -532,10 +607,11 @@ irqsoff latency trace v1.1.5 on 2.6.26 Here we see that that we had a latency of 12 microsecs (which is -very good). The _write_lock_irq in sys_setpgid disabled interrupts. -The difference between the 12 and the displayed timestamp 14us occurred -because the clock was incremented between the time of recording the max -latency and the time of recording the function that had that latency. +very good). The _write_lock_irq in sys_setpgid disabled +interrupts. The difference between the 12 and the displayed +timestamp 14us occurred because the clock was incremented +between the time of recording the max latency and the time of +recording the function that had that latency. Note the above example had ftrace_enabled not set. If we set the ftrace_enabled, we get a much larger output: @@ -586,24 +662,24 @@ irqsoff latency trace v1.1.5 on 2.6.26-r Here we traced a 50 microsecond latency. But we also see all the -functions that were called during that time. Note that by enabling -function tracing, we incur an added overhead. This overhead may -extend the latency times. But nevertheless, this trace has provided -some very helpful debugging information. +functions that were called during that time. Note that by +enabling function tracing, we incur an added overhead. This +overhead may extend the latency times. But nevertheless, this +trace has provided some very helpful debugging information. preemptoff ---------- -When preemption is disabled, we may be able to receive interrupts but -the task cannot be preempted and a higher priority task must wait -for preemption to be enabled again before it can preempt a lower -priority task. +When preemption is disabled, we may be able to receive +interrupts but the task cannot be preempted and a higher +priority task must wait for preemption to be enabled again +before it can preempt a lower priority task. The preemptoff tracer traces the places that disable preemption. -Like the irqsoff tracer, it records the maximum latency for which preemption -was disabled. The control of preemptoff tracer is much like the irqsoff -tracer. +Like the irqsoff tracer, it records the maximum latency for +which preemption was disabled. The control of preemptoff tracer +is much like the irqsoff tracer. # echo preemptoff > /debug/tracing/current_tracer # echo 0 > /debug/tracing/tracing_max_latency @@ -637,11 +713,12 @@ preemptoff latency trace v1.1.5 on 2.6.2 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) -This has some more changes. Preemption was disabled when an interrupt -came in (notice the 'h'), and was enabled while doing a softirq. -(notice the 's'). But we also see that interrupts have been disabled -when entering the preempt off section and leaving it (the 'd'). -We do not know if interrupts were enabled in the mean time. +This has some more changes. Preemption was disabled when an +interrupt came in (notice the 'h'), and was enabled while doing +a softirq. (notice the 's'). But we also see that interrupts +have been disabled when entering the preempt off section and +leaving it (the 'd'). We do not know if interrupts were enabled +in the mean time. # tracer: preemptoff # @@ -700,28 +777,30 @@ preemptoff latency trace v1.1.5 on 2.6.2 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) -The above is an example of the preemptoff trace with ftrace_enabled -set. Here we see that interrupts were disabled the entire time. -The irq_enter code lets us know that we entered an interrupt 'h'. -Before that, the functions being traced still show that it is not -in an interrupt, but we can see from the functions themselves that -this is not the case. - -Notice that __do_softirq when called does not have a preempt_count. -It may seem that we missed a preempt enabling. What really happened -is that the preempt count is held on the thread's stack and we -switched to the softirq stack (4K stacks in effect). The code -does not copy the preempt count, but because interrupts are disabled, -we do not need to worry about it. Having a tracer like this is good -for letting people know what really happens inside the kernel. +The above is an example of the preemptoff trace with +ftrace_enabled set. Here we see that interrupts were disabled +the entire time. The irq_enter code lets us know that we entered +an interrupt 'h'. Before that, the functions being traced still +show that it is not in an interrupt, but we can see from the +functions themselves that this is not the case. + +Notice that __do_softirq when called does not have a +preempt_count. It may seem that we missed a preempt enabling. +What really happened is that the preempt count is held on the +thread's stack and we switched to the softirq stack (4K stacks +in effect). The code does not copy the preempt count, but +because interrupts are disabled, we do not need to worry about +it. Having a tracer like this is good for letting people know +what really happens inside the kernel. preemptirqsoff -------------- -Knowing the locations that have interrupts disabled or preemption -disabled for the longest times is helpful. But sometimes we would -like to know when either preemption and/or interrupts are disabled. +Knowing the locations that have interrupts disabled or +preemption disabled for the longest times is helpful. But +sometimes we would like to know when either preemption and/or +interrupts are disabled. Consider the following code: @@ -741,11 +820,13 @@ The preemptoff tracer will record the to call_function_with_irqs_and_preemption_off() and call_function_with_preemption_off(). -But neither will trace the time that interrupts and/or preemption -is disabled. This total time is the time that we can not schedule. -To record this time, use the preemptirqsoff tracer. +But neither will trace the time that interrupts and/or +preemption is disabled. This total time is the time that we can +not schedule. To record this time, use the preemptirqsoff +tracer. -Again, using this trace is much like the irqsoff and preemptoff tracers. +Again, using this trace is much like the irqsoff and preemptoff +tracers. # echo preemptirqsoff > /debug/tracing/current_tracer # echo 0 > /debug/tracing/tracing_max_latency @@ -781,9 +862,10 @@ preemptirqsoff latency trace v1.1.5 on 2 The trace_hardirqs_off_thunk is called from assembly on x86 when -interrupts are disabled in the assembly code. Without the function -tracing, we do not know if interrupts were enabled within the preemption -points. We do see that it started with preemption enabled. +interrupts are disabled in the assembly code. Without the +function tracing, we do not know if interrupts were enabled +within the preemption points. We do see that it started with +preemption enabled. Here is a trace with ftrace_enabled set: @@ -871,40 +953,42 @@ preemptirqsoff latency trace v1.1.5 on 2 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) -This is a very interesting trace. It started with the preemption of -the ls task. We see that the task had the "need_resched" bit set -via the 'N' in the trace. Interrupts were disabled before the spin_lock -at the beginning of the trace. We see that a schedule took place to run -sshd. When the interrupts were enabled, we took an interrupt. -On return from the interrupt handler, the softirq ran. We took another -interrupt while running the softirq as we see from the capital 'H'. +This is a very interesting trace. It started with the preemption +of the ls task. We see that the task had the "need_resched" bit +set via the 'N' in the trace. Interrupts were disabled before +the spin_lock at the beginning of the trace. We see that a +schedule took place to run sshd. When the interrupts were +enabled, we took an interrupt. On return from the interrupt +handler, the softirq ran. We took another interrupt while +running the softirq as we see from the capital 'H'. wakeup ------ -In a Real-Time environment it is very important to know the wakeup -time it takes for the highest priority task that is woken up to the -time that it executes. This is also known as "schedule latency". -I stress the point that this is about RT tasks. It is also important -to know the scheduling latency of non-RT tasks, but the average -schedule latency is better for non-RT tasks. Tools like -LatencyTop are more appropriate for such measurements. +In a Real-Time environment it is very important to know the +wakeup time it takes for the highest priority task that is woken +up to the time that it executes. This is also known as "schedule +latency". I stress the point that this is about RT tasks. It is +also important to know the scheduling latency of non-RT tasks, +but the average schedule latency is better for non-RT tasks. +Tools like LatencyTop are more appropriate for such +measurements. Real-Time environments are interested in the worst case latency. -That is the longest latency it takes for something to happen, and -not the average. We can have a very fast scheduler that may only -have a large latency once in a while, but that would not work well -with Real-Time tasks. The wakeup tracer was designed to record -the worst case wakeups of RT tasks. Non-RT tasks are not recorded -because the tracer only records one worst case and tracing non-RT -tasks that are unpredictable will overwrite the worst case latency -of RT tasks. - -Since this tracer only deals with RT tasks, we will run this slightly -differently than we did with the previous tracers. Instead of performing -an 'ls', we will run 'sleep 1' under 'chrt' which changes the -priority of the task. +That is the longest latency it takes for something to happen, +and not the average. We can have a very fast scheduler that may +only have a large latency once in a while, but that would not +work well with Real-Time tasks. The wakeup tracer was designed +to record the worst case wakeups of RT tasks. Non-RT tasks are +not recorded because the tracer only records one worst case and +tracing non-RT tasks that are unpredictable will overwrite the +worst case latency of RT tasks. + +Since this tracer only deals with RT tasks, we will run this +slightly differently than we did with the previous tracers. +Instead of performing an 'ls', we will run 'sleep 1' under +'chrt' which changes the priority of the task. # echo wakeup > /debug/tracing/current_tracer # echo 0 > /debug/tracing/tracing_max_latency @@ -934,17 +1018,16 @@ wakeup latency trace v1.1.5 on 2.6.26-rc -0 1d..4 4us : schedule (cpu_idle) - -Running this on an idle system, we see that it only took 4 microseconds -to perform the task switch. Note, since the trace marker in the -schedule is before the actual "switch", we stop the tracing when -the recorded task is about to schedule in. This may change if -we add a new marker at the end of the scheduler. - -Notice that the recorded task is 'sleep' with the PID of 4901 and it -has an rt_prio of 5. This priority is user-space priority and not -the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 -for SCHED_RR. +Running this on an idle system, we see that it only took 4 +microseconds to perform the task switch. Note, since the trace +marker in the schedule is before the actual "switch", we stop +the tracing when the recorded task is about to schedule in. This +may change if we add a new marker at the end of the scheduler. + +Notice that the recorded task is 'sleep' with the PID of 4901 +and it has an rt_prio of 5. This priority is user-space priority +and not the internal kernel priority. The policy is 1 for +SCHED_FIFO and 2 for SCHED_RR. Doing the same with chrt -r 5 and ftrace_enabled set. @@ -1001,24 +1084,25 @@ ksoftirq-7 1d..6 49us : _spin_unlo ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) ksoftirq-7 1d..4 50us : schedule (__cond_resched) -The interrupt went off while running ksoftirqd. This task runs at -SCHED_OTHER. Why did not we see the 'N' set early? This may be -a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks -configured, the interrupt and softirq run with their own stack. -Some information is held on the top of the task's stack (need_resched -and preempt_count are both stored there). The setting of the NEED_RESCHED -bit is done directly to the task's stack, but the reading of the -NEED_RESCHED is done by looking at the current stack, which in this case -is the stack for the hard interrupt. This hides the fact that NEED_RESCHED -has been set. We do not see the 'N' until we switch back to the task's +The interrupt went off while running ksoftirqd. This task runs +at SCHED_OTHER. Why did not we see the 'N' set early? This may +be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K +stacks configured, the interrupt and softirq run with their own +stack. Some information is held on the top of the task's stack +(need_resched and preempt_count are both stored there). The +setting of the NEED_RESCHED bit is done directly to the task's +stack, but the reading of the NEED_RESCHED is done by looking at +the current stack, which in this case is the stack for the hard +interrupt. This hides the fact that NEED_RESCHED has been set. +We do not see the 'N' until we switch back to the task's assigned stack. function -------- This tracer is the function tracer. Enabling the function tracer -can be done from the debug file system. Make sure the ftrace_enabled is -set; otherwise this tracer is a nop. +can be done from the debug file system. Make sure the +ftrace_enabled is set; otherwise this tracer is a nop. # sysctl kernel.ftrace_enabled=1 # echo function > /debug/tracing/current_tracer @@ -1048,14 +1132,15 @@ set; otherwise this tracer is a nop. [...] -Note: function tracer uses ring buffers to store the above entries. -The newest data may overwrite the oldest data. Sometimes using echo to -stop the trace is not sufficient because the tracing could have overwritten -the data that you wanted to record. For this reason, it is sometimes better to -disable tracing directly from a program. This allows you to stop the -tracing at the point that you hit the part that you are interested in. -To disable the tracing directly from a C program, something like following -code snippet can be used: +Note: function tracer uses ring buffers to store the above +entries. The newest data may overwrite the oldest data. +Sometimes using echo to stop the trace is not sufficient because +the tracing could have overwritten the data that you wanted to +record. For this reason, it is sometimes better to disable +tracing directly from a program. This allows you to stop the +tracing at the point that you hit the part that you are +interested in. To disable the tracing directly from a C program, +something like following code snippet can be used: int trace_fd; [...] @@ -1070,10 +1155,10 @@ int main(int argc, char *argv[]) { } Note: Here we hard coded the path name. The debugfs mount is not -guaranteed to be at /debug (and is more commonly at /sys/kernel/debug). -For simple one time traces, the above is sufficent. For anything else, -a search through /proc/mounts may be needed to find where the debugfs -file-system is mounted. +guaranteed to be at /debug (and is more commonly at +/sys/kernel/debug). For simple one time traces, the above is +sufficent. For anything else, a search through /proc/mounts may +be needed to find where the debugfs file-system is mounted. Single thread tracing @@ -1152,49 +1237,297 @@ int main (int argc, char **argv) return 0; } + +hw-branch-tracer (x86 only) +--------------------------- + +This tracer uses the x86 last branch tracing hardware feature to +collect a branch trace on all cpus with relatively low overhead. + +The tracer uses a fixed-size circular buffer per cpu and only +traces ring 0 branches. The trace file dumps that buffer in the +following format: + +# tracer: hw-branch-tracer +# +# CPU# TO <- FROM + 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 + 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a + 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf + 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf + 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a + 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf + + +The tracer may be used to dump the trace for the oops'ing cpu on +a kernel oops into the system log. To enable this, +ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one +can either use the sysctl function or set it via the proc system +interface. + + sysctl kernel.ftrace_dump_on_oops=1 + +or + + echo 1 > /proc/sys/kernel/ftrace_dump_on_oops + + +Here's an example of such a dump after a null pointer +dereference in a kernel module: + +[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 +[57848.106019] IP: [] open+0x6/0x14 [oops] +[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 +[57848.106019] Oops: 0002 [#1] SMP +[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus +[57848.106019] Dumping ftrace buffer: +[57848.106019] --------------------------------- +[...] +[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 +[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 +[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 +[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 +[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 +[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] +[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 +[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b +[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 +[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 +[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 +[...] +[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 +[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 +[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 +[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 +[...] +[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 +[57848.106019] --------------------------------- +[57848.106019] CPU 0 +[57848.106019] Modules linked in: oops +[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 +[57848.106019] RIP: 0010:[] [] open+0x6/0x14 [oops] +[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 +[...] + + +function graph tracer +--------------------------- + +This tracer is similar to the function tracer except that it +probes a function on its entry and its exit. This is done by +using a dynamically allocated stack of return addresses in each +task_struct. On function entry the tracer overwrites the return +address of each function traced to set a custom probe. Thus the +original return address is stored on the stack of return address +in the task_struct. + +Probing on both ends of a function leads to special features +such as: + +- measure of a function's time execution +- having a reliable call stack to draw function calls graph + +This tracer is useful in several situations: + +- you want to find the reason of a strange kernel behavior and + need to see what happens in detail on any areas (or specific + ones). + +- you are experiencing weird latencies but it's difficult to + find its origin. + +- you want to find quickly which path is taken by a specific + function + +- you just want to peek inside a working kernel and want to see + what happens there. + +# tracer: function_graph +# +# CPU DURATION FUNCTION CALLS +# | | | | | | | + + 0) | sys_open() { + 0) | do_sys_open() { + 0) | getname() { + 0) | kmem_cache_alloc() { + 0) 1.382 us | __might_sleep(); + 0) 2.478 us | } + 0) | strncpy_from_user() { + 0) | might_fault() { + 0) 1.389 us | __might_sleep(); + 0) 2.553 us | } + 0) 3.807 us | } + 0) 7.876 us | } + 0) | alloc_fd() { + 0) 0.668 us | _spin_lock(); + 0) 0.570 us | expand_files(); + 0) 0.586 us | _spin_unlock(); + + +There are several columns that can be dynamically +enabled/disabled. You can use every combination of options you +want, depending on your needs. + +- The cpu number on which the function executed is default + enabled. It is sometimes better to only trace one cpu (see + tracing_cpu_mask file) or you might sometimes see unordered + function calls while cpu tracing switch. + + hide: echo nofuncgraph-cpu > /debug/tracing/trace_options + show: echo funcgraph-cpu > /debug/tracing/trace_options + +- The duration (function's time of execution) is displayed on + the closing bracket line of a function or on the same line + than the current function in case of a leaf one. It is default + enabled. + + hide: echo nofuncgraph-duration > /debug/tracing/trace_options + show: echo funcgraph-duration > /debug/tracing/trace_options + +- The overhead field precedes the duration field in case of + reached duration thresholds. + + hide: echo nofuncgraph-overhead > /debug/tracing/trace_options + show: echo funcgraph-overhead > /debug/tracing/trace_options + depends on: funcgraph-duration + + ie: + + 0) | up_write() { + 0) 0.646 us | _spin_lock_irqsave(); + 0) 0.684 us | _spin_unlock_irqrestore(); + 0) 3.123 us | } + 0) 0.548 us | fput(); + 0) + 58.628 us | } + + [...] + + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } + 0) 2.861 us | } + 0) ! 115.305 us | } + 0) ! 116.402 us | } + + + means that the function exceeded 10 usecs. + ! means that the function exceeded 100 usecs. + + +- The task/pid field displays the thread cmdline and pid which + executed the function. It is default disabled. + + hide: echo nofuncgraph-proc > /debug/tracing/trace_options + show: echo funcgraph-proc > /debug/tracing/trace_options + + ie: + + # tracer: function_graph + # + # CPU TASK/PID DURATION FUNCTION CALLS + # | | | | | | | | | + 0) sh-4802 | | d_free() { + 0) sh-4802 | | call_rcu() { + 0) sh-4802 | | __call_rcu() { + 0) sh-4802 | 0.616 us | rcu_process_gp_end(); + 0) sh-4802 | 0.586 us | check_for_new_grace_period(); + 0) sh-4802 | 2.899 us | } + 0) sh-4802 | 4.040 us | } + 0) sh-4802 | 5.151 us | } + 0) sh-4802 | + 49.370 us | } + + +- The absolute time field is an absolute timestamp given by the + system clock since it started. A snapshot of this time is + given on each entry/exit of functions + + hide: echo nofuncgraph-abstime > /debug/tracing/trace_options + show: echo funcgraph-abstime > /debug/tracing/trace_options + + ie: + + # + # TIME CPU DURATION FUNCTION CALLS + # | | | | | | | | + 360.774522 | 1) 0.541 us | } + 360.774522 | 1) 4.663 us | } + 360.774523 | 1) 0.541 us | __wake_up_bit(); + 360.774524 | 1) 6.796 us | } + 360.774524 | 1) 7.952 us | } + 360.774525 | 1) 9.063 us | } + 360.774525 | 1) 0.615 us | journal_mark_dirty(); + 360.774527 | 1) 0.578 us | __brelse(); + 360.774528 | 1) | reiserfs_prepare_for_journal() { + 360.774528 | 1) | unlock_buffer() { + 360.774529 | 1) | wake_up_bit() { + 360.774529 | 1) | bit_waitqueue() { + 360.774530 | 1) 0.594 us | __phys_addr(); + + +You can put some comments on specific functions by using +ftrace_printk() For example, if you want to put a comment inside +the __might_sleep() function, you just have to include + and call ftrace_printk() inside __might_sleep() + +ftrace_printk("I'm a comment!\n") + +will produce: + + 1) | __might_sleep() { + 1) | /* I'm a comment! */ + 1) 1.449 us | } + + +You might find other useful features for this tracer in the +following "dynamic ftrace" section such as tracing only specific +functions or tasks. + dynamic ftrace -------------- If CONFIG_DYNAMIC_FTRACE is set, the system will run with virtually no overhead when function tracing is disabled. The way this works is the mcount function call (placed at the start of -every kernel function, produced by the -pg switch in gcc), starts -of pointing to a simple return. (Enabling FTRACE will include the --pg switch in the compiling of the kernel.) +every kernel function, produced by the -pg switch in gcc), +starts of pointing to a simple return. (Enabling FTRACE will +include the -pg switch in the compiling of the kernel.) At compile time every C file object is run through the recordmcount.pl script (located in the scripts directory). This script will process the C object using objdump to find all the -locations in the .text section that call mcount. (Note, only -the .text section is processed, since processing other sections -like .init.text may cause races due to those sections being freed). - -A new section called "__mcount_loc" is created that holds references -to all the mcount call sites in the .text section. This section is -compiled back into the original object. The final linker will add -all these references into a single table. +locations in the .text section that call mcount. (Note, only the +.text section is processed, since processing other sections like +.init.text may cause races due to those sections being freed). + +A new section called "__mcount_loc" is created that holds +references to all the mcount call sites in the .text section. +This section is compiled back into the original object. The +final linker will add all these references into a single table. On boot up, before SMP is initialized, the dynamic ftrace code -scans this table and updates all the locations into nops. It also -records the locations, which are added to the available_filter_functions -list. Modules are processed as they are loaded and before they are -executed. When a module is unloaded, it also removes its functions from -the ftrace function list. This is automatic in the module unload -code, and the module author does not need to worry about it. - -When tracing is enabled, kstop_machine is called to prevent races -with the CPUS executing code being modified (which can cause the -CPU to do undesireable things), and the nops are patched back -to calls. But this time, they do not call mcount (which is just -a function stub). They now call into the ftrace infrastructure. +scans this table and updates all the locations into nops. It +also records the locations, which are added to the +available_filter_functions list. Modules are processed as they +are loaded and before they are executed. When a module is +unloaded, it also removes its functions from the ftrace function +list. This is automatic in the module unload code, and the +module author does not need to worry about it. + +When tracing is enabled, kstop_machine is called to prevent +races with the CPUS executing code being modified (which can +cause the CPU to do undesireable things), and the nops are +patched back to calls. But this time, they do not call mcount +(which is just a function stub). They now call into the ftrace +infrastructure. One special side-effect to the recording of the functions being traced is that we can now selectively choose which functions we -wish to trace and which ones we want the mcount calls to remain as -nops. +wish to trace and which ones we want the mcount calls to remain +as nops. -Two files are used, one for enabling and one for disabling the tracing -of specified functions. They are: +Two files are used, one for enabling and one for disabling the +tracing of specified functions. They are: set_ftrace_filter @@ -1202,8 +1535,8 @@ and set_ftrace_notrace -A list of available functions that you can add to these files is listed -in: +A list of available functions that you can add to these files is +listed in: available_filter_functions @@ -1240,8 +1573,8 @@ hrtimer_interrupt sys_nanosleep -Perhaps this is not enough. The filters also allow simple wild cards. -Only the following are currently available +Perhaps this is not enough. The filters also allow simple wild +cards. Only the following are currently available * - will match functions that begin with * - will match functions that end with @@ -1251,9 +1584,9 @@ These are the only wild cards which are * will not work. -Note: It is better to use quotes to enclose the wild cards, otherwise - the shell may expand the parameters into names of files in the local - directory. +Note: It is better to use quotes to enclose the wild cards, + otherwise the shell may expand the parameters into names + of files in the local directory. # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter @@ -1299,7 +1632,8 @@ This is because the '>' and '>>' act jus To rewrite the filters, use '>' To append to the filters, use '>>' -To clear out a filter so that all functions will be recorded again: +To clear out a filter so that all functions will be recorded +again: # echo > /debug/tracing/set_ftrace_filter # cat /debug/tracing/set_ftrace_filter @@ -1331,7 +1665,8 @@ hrtimer_get_res hrtimer_init_sleeper -The set_ftrace_notrace prevents those functions from being traced. +The set_ftrace_notrace prevents those functions from being +traced. # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace @@ -1353,13 +1688,75 @@ Produces: We can see that there's no more lock or preempt tracing. + +Dynamic ftrace with the function graph tracer +--------------------------------------------- + +Although what has been explained above concerns both the +function tracer and the function-graph-tracer, there are some +special features only available in the function-graph tracer. + +If you want to trace only one function and all of its children, +you just have to echo its name into set_graph_function: + + echo __do_fault > set_graph_function + +will produce the following "expanded" trace of the __do_fault() +function: + + 0) | __do_fault() { + 0) | filemap_fault() { + 0) | find_lock_page() { + 0) 0.804 us | find_get_page(); + 0) | __might_sleep() { + 0) 1.329 us | } + 0) 3.904 us | } + 0) 4.979 us | } + 0) 0.653 us | _spin_lock(); + 0) 0.578 us | page_add_file_rmap(); + 0) 0.525 us | native_set_pte_at(); + 0) 0.585 us | _spin_unlock(); + 0) | unlock_page() { + 0) 0.541 us | page_waitqueue(); + 0) 0.639 us | __wake_up_bit(); + 0) 2.786 us | } + 0) + 14.237 us | } + 0) | __do_fault() { + 0) | filemap_fault() { + 0) | find_lock_page() { + 0) 0.698 us | find_get_page(); + 0) | __might_sleep() { + 0) 1.412 us | } + 0) 3.950 us | } + 0) 5.098 us | } + 0) 0.631 us | _spin_lock(); + 0) 0.571 us | page_add_file_rmap(); + 0) 0.526 us | native_set_pte_at(); + 0) 0.586 us | _spin_unlock(); + 0) | unlock_page() { + 0) 0.533 us | page_waitqueue(); + 0) 0.638 us | __wake_up_bit(); + 0) 2.793 us | } + 0) + 14.012 us | } + +You can also expand several functions at once: + + echo sys_open > set_graph_function + echo sys_close >> set_graph_function + +Now if you want to go back to trace all functions you can clear +this special filter via: + + echo > set_graph_function + + trace_pipe ---------- -The trace_pipe outputs the same content as the trace file, but the effect -on the tracing is different. Every read from trace_pipe is consumed. -This means that subsequent reads will be different. The trace -is live. +The trace_pipe outputs the same content as the trace file, but +the effect on the tracing is different. Every read from +trace_pipe is consumed. This means that subsequent reads will be +different. The trace is live. # echo function > /debug/tracing/current_tracer # cat /debug/tracing/trace_pipe > /tmp/trace.out & @@ -1387,38 +1784,45 @@ is live. bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up -Note, reading the trace_pipe file will block until more input is added. -By changing the tracer, trace_pipe will issue an EOF. We needed -to set the function tracer _before_ we "cat" the trace_pipe file. +Note, reading the trace_pipe file will block until more input is +added. By changing the tracer, trace_pipe will issue an EOF. We +needed to set the function tracer _before_ we "cat" the +trace_pipe file. trace entries ------------- -Having too much or not enough data can be troublesome in diagnosing -an issue in the kernel. The file buffer_size_kb is used to modify -the size of the internal trace buffers. The number listed -is the number of entries that can be recorded per CPU. To know -the full size, multiply the number of possible CPUS with the -number of entries. +Having too much or not enough data can be troublesome in +diagnosing an issue in the kernel. The file buffer_size_kb is +used to modify the size of the internal trace buffers. The +number listed is the number of entries that can be recorded per +CPU. To know the full size, multiply the number of possible CPUS +with the number of entries. # cat /debug/tracing/buffer_size_kb 1408 (units kilobytes) -Note, to modify this, you must have tracing completely disabled. To do that, -echo "nop" into the current_tracer. If the current_tracer is not set -to "nop", an EINVAL error will be returned. +Note, to modify this, you must have tracing completely disabled. +To do that, echo "nop" into the current_tracer. If the +current_tracer is not set to "nop", an EINVAL error will be +returned. # echo nop > /debug/tracing/current_tracer # echo 10000 > /debug/tracing/buffer_size_kb # cat /debug/tracing/buffer_size_kb 10000 (units kilobytes) -The number of pages which will be allocated is limited to a percentage -of available memory. Allocating too much will produce an error. +The number of pages which will be allocated is limited to a +percentage of available memory. Allocating too much will produce +an error. # echo 1000000000000 > /debug/tracing/buffer_size_kb -bash: echo: write error: Cannot allocate memory # cat /debug/tracing/buffer_size_kb 85 +----------- + +More details can be found in the source code, in the +kernel/tracing/*.c files. Index: linux-2.6-tip/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6-tip.orig/Documentation/kernel-parameters.txt +++ linux-2.6-tip/Documentation/kernel-parameters.txt @@ -49,6 +49,7 @@ parameter is applicable: ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. JOY Appropriate joystick support is enabled. + KMEMTRACE kmemtrace is enabled. LIBATA Libata driver is enabled LP Printer support is enabled. LOOP Loopback device support is enabled. @@ -492,10 +493,12 @@ and is between 256 and 4096 characters. Default: 64 hpet= [X86-32,HPET] option to control HPET usage - Format: { enable (default) | disable | force } + Format: { enable (default) | disable | force | + verbose } disable: disable HPET and use PIT instead force: allow force enabled of undocumented chips (ICH4, VIA, nVidia) + verbose: show contents of HPET registers during setup com20020= [HW,NET] ARCnet - COM20020 chipset Format: @@ -1045,6 +1048,15 @@ and is between 256 and 4096 characters. use the HighMem zone if it exists, and the Normal zone if it does not. + kmemtrace.enable= [KNL,KMEMTRACE] Format: { yes | no } + Controls whether kmemtrace is enabled + at boot-time. + + kmemtrace.subbufs=n [KNL,KMEMTRACE] Overrides the number of + subbufs kmemtrace's relay channel has. Set this + higher than default (KMEMTRACE_N_SUBBUFS in code) if + you experience buffer overruns. + movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. Index: linux-2.6-tip/Documentation/kmemcheck.txt =================================================================== --- /dev/null +++ linux-2.6-tip/Documentation/kmemcheck.txt @@ -0,0 +1,129 @@ +Contents +======== + + 1. How to use + 2. Technical description + 3. Changes to the slab allocators + 4. Problems + 5. Parameters + 6. Future enhancements + + +How to use (IMPORTANT) +====================== + +Always remember this: kmemcheck _will_ give false positives. So don't enable +it and spam the mailing list with its reports; you are not going to be heard, +and it will make people's skins thicker for when the real errors are found. + +Instead, I encourage maintainers and developers to find errors in _their_ +_own_ code. And if you find false positives, you can try to work around them, +try to figure out if it's a real bug or not, or simply ignore them. Most +developers know their own code and will quickly and efficiently determine the +root cause of a kmemcheck report. This is therefore also the most efficient +way to work with kmemcheck. + +If you still want to run kmemcheck to inspect others' code, the rule of thumb +should be: If it's not obvious (to you), don't tell us about it either. Most +likely the code is correct and you'll only waste our time. If you can work +out the error, please do send the maintainer a heads up and/or a patch, but +don't expect him/her to fix something that wasn't wrong in the first place. + + +Technical description +===================== + +kmemcheck works by marking memory pages non-present. This means that whenever +somebody attempts to access the page, a page fault is generated. The page +fault handler notices that the page was in fact only hidden, and so it calls +on the kmemcheck code to make further investigations. + +When the investigations are completed, kmemcheck "shows" the page by marking +it present (as it would be under normal circumstances). This way, the +interrupted code can continue as usual. + +But after the instruction has been executed, we should hide the page again, so +that we can catch the next access too! Now kmemcheck makes use of a debugging +feature of the processor, namely single-stepping. When the processor has +finished the one instruction that generated the memory access, a debug +exception is raised. From here, we simply hide the page again and continue +execution, this time with the single-stepping feature turned off. + + +Changes to the slab allocators +============================== + +kmemcheck requires some assistance from the memory allocator in order to work. +The memory allocator needs to + +1. Tell kmemcheck about newly allocated pages and pages that are about to + be freed. This allows kmemcheck to set up and tear down the shadow memory + for the pages in question. The shadow memory stores the status of each byte + in the allocation proper, e.g. whether it is initialized or uninitialized. +2. Tell kmemcheck which parts of memory should be marked uninitialized. There + are actually a few more states, such as "not yet allocated" and "recently + freed". + +If a slab cache is set up using the SLAB_NOTRACK flag, it will never return +memory that can take page faults because of kmemcheck. + +If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still +request memory with the __GFP_NOTRACK flag. This does not prevent the page +faults from occurring, however, but marks the object in question as being +initialized so that no warnings will ever be produced for this object. + +Currently, the SLAB and SLUB allocators are supported by kmemcheck. + + +Problems +======== + +The most prominent problem seems to be that of bit-fields. kmemcheck can only +track memory with byte granularity. Therefore, when gcc generates code to +access only one bit in a bit-field, there is really no way for kmemcheck to +know which of the other bits will be used or thrown away. Consequently, there +may be bogus warnings for bit-field accesses. We have added a "bitfields" API +to get around this problem. See include/linux/kmemcheck.h for detailed +instructions! + + +Parameters +========== + +In addition to enabling CONFIG_KMEMCHECK before the kernel is compiled, the +parameter kmemcheck=1 must be passed to the kernel when it is started in order +to actually do the tracking. So by default, there is only a very small +(probably negligible) overhead for enabling the config option. + +Similarly, kmemcheck may be turned on or off at run-time using, respectively: + +echo 1 > /proc/sys/kernel/kmemcheck + and +echo 0 > /proc/sys/kernel/kmemcheck + +Note that this is a lazy setting; once turned off, the old allocations will +still have to take a single page fault exception before tracking is turned off +for that particular page. Enabling kmemcheck on will only enable tracking for +allocations made from that point onwards. + +The default mode is the one-shot mode, where only the first error is reported +before kmemcheck is disabled. This mode can be enabled by passing kmemcheck=2 +to the kernel at boot, or running + +echo 2 > /proc/sys/kernel/kmemcheck + +when the kernel is already running. + + +Future enhancements +=================== + +There is already some preliminary support for catching use-after-free errors. +What still needs to be done is delaying kfree() so that memory is not +reallocated immediately after freeing it. [Suggested by Pekka Enberg.] + +It should be possible to allow SMP systems by duplicating the page tables for +each processor in the system. This is probably extremely difficult, however. +[Suggested by Ingo Molnar.] + +Support for instruction set extensions like XMM, SSE2, etc. Index: linux-2.6-tip/Documentation/lockdep-design.txt =================================================================== --- linux-2.6-tip.orig/Documentation/lockdep-design.txt +++ linux-2.6-tip/Documentation/lockdep-design.txt @@ -27,33 +27,37 @@ lock-class. State ----- -The validator tracks lock-class usage history into 5 separate state bits: +The validator tracks lock-class usage history into 4n + 1 separate state bits: -- 'ever held in hardirq context' [ == hardirq-safe ] -- 'ever held in softirq context' [ == softirq-safe ] -- 'ever held with hardirqs enabled' [ == hardirq-unsafe ] -- 'ever held with softirqs and hardirqs enabled' [ == softirq-unsafe ] +- 'ever held in STATE context' +- 'ever head as readlock in STATE context' +- 'ever head with STATE enabled' +- 'ever head as readlock with STATE enabled' + +Where STATE can be either one of (kernel/lockdep_states.h) + - hardirq + - softirq + - reclaim_fs - 'ever used' [ == !unused ] -When locking rules are violated, these 4 state bits are presented in the -locking error messages, inside curlies. A contrived example: +When locking rules are violated, these state bits are presented in the +locking error messages, inside curlies. A contrived example: modprobe/2287 is trying to acquire lock: - (&sio_locks[i].lock){--..}, at: [] mutex_lock+0x21/0x24 + (&sio_locks[i].lock){-.-...}, at: [] mutex_lock+0x21/0x24 but task is already holding lock: - (&sio_locks[i].lock){--..}, at: [] mutex_lock+0x21/0x24 + (&sio_locks[i].lock){-.-...}, at: [] mutex_lock+0x21/0x24 -The bit position indicates hardirq, softirq, hardirq-read, -softirq-read respectively, and the character displayed in each -indicates: +The bit position indicates STATE, STATE-read, for each of the states listed +above, and the character displayed in each indicates: '.' acquired while irqs disabled '+' acquired in irq context '-' acquired with irqs enabled - '?' read acquired in irq context with irqs enabled. + '?' acquired in irq context with irqs enabled. Unused mutexes cannot be part of the cause of an error. Index: linux-2.6-tip/Documentation/perf-counters.txt =================================================================== --- /dev/null +++ linux-2.6-tip/Documentation/perf-counters.txt @@ -0,0 +1,147 @@ + +Performance Counters for Linux +------------------------------ + +Performance counters are special hardware registers available on most modern +CPUs. These registers count the number of certain types of hw events: such +as instructions executed, cachemisses suffered, or branches mis-predicted - +without slowing down the kernel or applications. These registers can also +trigger interrupts when a threshold number of events have passed - and can +thus be used to profile the code that runs on that CPU. + +The Linux Performance Counter subsystem provides an abstraction of these +hardware capabilities. It provides per task and per CPU counters, counter +groups, and it provides event capabilities on top of those. + +Performance counters are accessed via special file descriptors. +There's one file descriptor per virtual counter used. + +The special file descriptor is opened via the perf_counter_open() +system call: + + int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr, + pid_t pid, int cpu, int group_fd); + +The syscall returns the new fd. The fd can be used via the normal +VFS system calls: read() can be used to read the counter, fcntl() +can be used to set the blocking mode, etc. + +Multiple counters can be kept open at a time, and the counters +can be poll()ed. + +When creating a new counter fd, 'perf_counter_hw_event' is: + +/* + * Hardware event to monitor via a performance monitoring counter: + */ +struct perf_counter_hw_event { + s64 type; + + u64 irq_period; + u32 record_type; + + u32 disabled : 1, /* off by default */ + nmi : 1, /* NMI sampling */ + raw : 1, /* raw event type */ + __reserved_1 : 29; + + u64 __reserved_2; +}; + +/* + * Generalized performance counter event types, used by the hw_event.type + * parameter of the sys_perf_counter_open() syscall: + */ +enum hw_event_types { + /* + * Common hardware events, generalized by the kernel: + */ + PERF_COUNT_CYCLES = 0, + PERF_COUNT_INSTRUCTIONS = 1, + PERF_COUNT_CACHE_REFERENCES = 2, + PERF_COUNT_CACHE_MISSES = 3, + PERF_COUNT_BRANCH_INSTRUCTIONS = 4, + PERF_COUNT_BRANCH_MISSES = 5, + + /* + * Special "software" counters provided by the kernel, even if + * the hardware does not support performance counters. These + * counters measure various physical and sw events of the + * kernel (and allow the profiling of them as well): + */ + PERF_COUNT_CPU_CLOCK = -1, + PERF_COUNT_TASK_CLOCK = -2, + /* + * Future software events: + */ + /* PERF_COUNT_PAGE_FAULTS = -3, + PERF_COUNT_CONTEXT_SWITCHES = -4, */ +}; + +These are standardized types of events that work uniformly on all CPUs +that implements Performance Counters support under Linux. If a CPU is +not able to count branch-misses, then the system call will return +-EINVAL. + +More hw_event_types are supported as well, but they are CPU +specific and are enumerated via /sys on a per CPU basis. Raw hw event +types can be passed in under hw_event.type if hw_event.raw is 1. +For example, to count "External bus cycles while bus lock signal asserted" +events on Intel Core CPUs, pass in a 0x4064 event type value and set +hw_event.raw to 1. + +'record_type' is the type of data that a read() will provide for the +counter, and it can be one of: + +/* + * IRQ-notification data record type: + */ +enum perf_counter_record_type { + PERF_RECORD_SIMPLE = 0, + PERF_RECORD_IRQ = 1, + PERF_RECORD_GROUP = 2, +}; + +a "simple" counter is one that counts hardware events and allows +them to be read out into a u64 count value. (read() returns 8 on +a successful read of a simple counter.) + +An "irq" counter is one that will also provide an IRQ context information: +the IP of the interrupted context. In this case read() will return +the 8-byte counter value, plus the Instruction Pointer address of the +interrupted context. + +The parameter 'hw_event_period' is the number of events before waking up +a read() that is blocked on a counter fd. Zero value means a non-blocking +counter. + +The 'pid' parameter allows the counter to be specific to a task: + + pid == 0: if the pid parameter is zero, the counter is attached to the + current task. + + pid > 0: the counter is attached to a specific task (if the current task + has sufficient privilege to do so) + + pid < 0: all tasks are counted (per cpu counters) + +The 'cpu' parameter allows a counter to be made specific to a full +CPU: + + cpu >= 0: the counter is restricted to a specific CPU + cpu == -1: the counter counts on all CPUs + +(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) + +A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts +events of that task and 'follows' that task to whatever CPU the task +gets schedule to. Per task counters can be created by any user, for +their own tasks. + +A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts +all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. + +Group counters are created by passing in a group_fd of another counter. +Groups are scheduled at once and can be used with PERF_RECORD_GROUP +to record multi-dimensional timestamps. + Index: linux-2.6-tip/Documentation/sysrq.txt =================================================================== --- linux-2.6-tip.orig/Documentation/sysrq.txt +++ linux-2.6-tip/Documentation/sysrq.txt @@ -113,6 +113,8 @@ On all - write a character to /proc/sys 'x' - Used by xmon interface on ppc/powerpc platforms. +'z' - Dump the ftrace buffer + '0'-'9' - Sets the console log level, controlling which kernel messages will be printed to your console. ('0', for example would make it so that only emergency messages like PANICs or OOPSes would Index: linux-2.6-tip/Documentation/vm/kmemtrace.txt =================================================================== --- /dev/null +++ linux-2.6-tip/Documentation/vm/kmemtrace.txt @@ -0,0 +1,126 @@ + kmemtrace - Kernel Memory Tracer + + by Eduard - Gabriel Munteanu + + +I. Introduction +=============== + +kmemtrace helps kernel developers figure out two things: +1) how different allocators (SLAB, SLUB etc.) perform +2) how kernel code allocates memory and how much + +To do this, we trace every allocation and export information to the userspace +through the relay interface. We export things such as the number of requested +bytes, the number of bytes actually allocated (i.e. including internal +fragmentation), whether this is a slab allocation or a plain kmalloc() and so +on. + +The actual analysis is performed by a userspace tool (see section III for +details on where to get it from). It logs the data exported by the kernel, +processes it and (as of writing this) can provide the following information: +- the total amount of memory allocated and fragmentation per call-site +- the amount of memory allocated and fragmentation per allocation +- total memory allocated and fragmentation in the collected dataset +- number of cross-CPU allocation and frees (makes sense in NUMA environments) + +Moreover, it can potentially find inconsistent and erroneous behavior in +kernel code, such as using slab free functions on kmalloc'ed memory or +allocating less memory than requested (but not truly failed allocations). + +kmemtrace also makes provisions for tracing on some arch and analysing the +data on another. + +II. Design and goals +==================== + +kmemtrace was designed to handle rather large amounts of data. Thus, it uses +the relay interface to export whatever is logged to userspace, which then +stores it. Analysis and reporting is done asynchronously, that is, after the +data is collected and stored. By design, it allows one to log and analyse +on different machines and different arches. + +As of writing this, the ABI is not considered stable, though it might not +change much. However, no guarantees are made about compatibility yet. When +deemed stable, the ABI should still allow easy extension while maintaining +backward compatibility. This is described further in Documentation/ABI. + +Summary of design goals: + - allow logging and analysis to be done across different machines + - be fast and anticipate usage in high-load environments (*) + - be reasonably extensible + - make it possible for GNU/Linux distributions to have kmemtrace + included in their repositories + +(*) - one of the reasons Pekka Enberg's original userspace data analysis + tool's code was rewritten from Perl to C (although this is more than a + simple conversion) + + +III. Quick usage guide +====================== + +1) Get a kernel that supports kmemtrace and build it accordingly (i.e. enable +CONFIG_KMEMTRACE). + +2) Get the userspace tool and build it: +$ git-clone git://repo.or.cz/kmemtrace-user.git # current repository +$ cd kmemtrace-user/ +$ ./autogen.sh +$ ./configure +$ make + +3) Boot the kmemtrace-enabled kernel if you haven't, preferably in the +'single' runlevel (so that relay buffers don't fill up easily), and run +kmemtrace: +# '$' does not mean user, but root here. +$ mount -t debugfs none /sys/kernel/debug +$ mount -t proc none /proc +$ cd path/to/kmemtrace-user/ +$ ./kmemtraced +Wait a bit, then stop it with CTRL+C. +$ cat /sys/kernel/debug/kmemtrace/total_overruns # Check if we didn't + # overrun, should + # be zero. +$ (Optionally) [Run kmemtrace_check separately on each cpu[0-9]*.out file to + check its correctness] +$ ./kmemtrace-report + +Now you should have a nice and short summary of how the allocator performs. + +IV. FAQ and known issues +======================== + +Q: 'cat /sys/kernel/debug/kmemtrace/total_overruns' is non-zero, how do I fix +this? Should I worry? +A: If it's non-zero, this affects kmemtrace's accuracy, depending on how +large the number is. You can fix it by supplying a higher +'kmemtrace.subbufs=N' kernel parameter. +--- + +Q: kmemtrace_check reports errors, how do I fix this? Should I worry? +A: This is a bug and should be reported. It can occur for a variety of +reasons: + - possible bugs in relay code + - possible misuse of relay by kmemtrace + - timestamps being collected unorderly +Or you may fix it yourself and send us a patch. +--- + +Q: kmemtrace_report shows many errors, how do I fix this? Should I worry? +A: This is a known issue and I'm working on it. These might be true errors +in kernel code, which may have inconsistent behavior (e.g. allocating memory +with kmem_cache_alloc() and freeing it with kfree()). Pekka Enberg pointed +out this behavior may work with SLAB, but may fail with other allocators. + +It may also be due to lack of tracing in some unusual allocator functions. + +We don't want bug reports regarding this issue yet. +--- + +V. See also +=========== + +Documentation/kernel-parameters.txt +Documentation/ABI/testing/debugfs-kmemtrace + Index: linux-2.6-tip/Documentation/x86/boot.txt =================================================================== --- linux-2.6-tip.orig/Documentation/x86/boot.txt +++ linux-2.6-tip/Documentation/x86/boot.txt @@ -158,7 +158,7 @@ Offset Proto Name Meaning 0202/4 2.00+ header Magic signature "HdrS" 0206/2 2.00+ version Boot protocol version supported 0208/4 2.00+ realmode_swtch Boot loader hook (see below) -020C/2 2.00+ start_sys The load-low segment (0x1000) (obsolete) +020C/2 2.00+ start_sys_seg The load-low segment (0x1000) (obsolete) 020E/2 2.00+ kernel_version Pointer to kernel version string 0210/1 2.00+ type_of_loader Boot loader identifier 0211/1 2.00+ loadflags Boot protocol option flags @@ -170,10 +170,11 @@ Offset Proto Name Meaning 0224/2 2.01+ heap_end_ptr Free memory after setup end 0226/2 N/A pad1 Unused 0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line -022C/4 2.03+ initrd_addr_max Highest legal initrd address +022C/4 2.03+ ramdisk_max Highest legal initrd address 0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel 0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not -0235/3 N/A pad2 Unused +0235/1 N/A pad2 Unused +0236/2 N/A pad3 Unused 0238/4 2.06+ cmdline_size Maximum size of the kernel command line 023C/4 2.07+ hardware_subarch Hardware subarchitecture 0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data @@ -299,14 +300,14 @@ Protocol: 2.00+ e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version 10.17. -Field name: readmode_swtch +Field name: realmode_swtch Type: modify (optional) Offset/size: 0x208/4 Protocol: 2.00+ Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.) -Field name: start_sys +Field name: start_sys_seg Type: read Offset/size: 0x20c/2 Protocol: 2.00+ @@ -468,7 +469,7 @@ Protocol: 2.02+ zero, the kernel will assume that your boot loader does not support the 2.02+ protocol. -Field name: initrd_addr_max +Field name: ramdisk_max Type: read Offset/size: 0x22c/4 Protocol: 2.03+ Index: linux-2.6-tip/MAINTAINERS =================================================================== --- linux-2.6-tip.orig/MAINTAINERS +++ linux-2.6-tip/MAINTAINERS @@ -2623,6 +2623,20 @@ M: jason.wessel@windriver.com L: kgdb-bugreport@lists.sourceforge.net S: Maintained +KMEMCHECK +P: Vegard Nossum +M: vegardno@ifi.uio.no +P Pekka Enberg +M: penberg@cs.helsinki.fi +L: linux-kernel@vger.kernel.org +S: Maintained + +KMEMTRACE +P: Eduard - Gabriel Munteanu +M: eduard.munteanu@linux360.ro +L: linux-kernel@vger.kernel.org +S: Maintained + KPROBES P: Ananth N Mavinakayanahalli M: ananth@in.ibm.com Index: linux-2.6-tip/Makefile =================================================================== --- linux-2.6-tip.orig/Makefile +++ linux-2.6-tip/Makefile @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 29 -EXTRAVERSION = -rc6 +EXTRAVERSION = -rc6-rt2 NAME = Erotic Pickled Herring # *DOCUMENTATION* @@ -533,8 +533,9 @@ KBUILD_CFLAGS += $(call cc-option,-Wfram endif # Force gcc to behave correct even for buggy distributions -# Arch Makefiles may override this setting +ifndef CONFIG_CC_STACKPROTECTOR KBUILD_CFLAGS += $(call cc-option, -fno-stack-protector) +endif ifdef CONFIG_FRAME_POINTER KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls @@ -551,6 +552,10 @@ ifdef CONFIG_FUNCTION_TRACER KBUILD_CFLAGS += -pg endif +ifndef CONFIG_ALLOW_WARNINGS +KBUILD_CFLAGS += -Werror ${WERROR} +endif + # We trigger additional mismatches with less inlining ifdef CONFIG_DEBUG_SECTION_MISMATCH KBUILD_CFLAGS += $(call cc-option, -fno-inline-functions-called-once) Index: linux-2.6-tip/arch/alpha/include/asm/hardirq.h =================================================================== --- linux-2.6-tip.orig/arch/alpha/include/asm/hardirq.h +++ linux-2.6-tip/arch/alpha/include/asm/hardirq.h @@ -14,17 +14,4 @@ typedef struct { void ack_bad_irq(unsigned int irq); -#define HARDIRQ_BITS 12 - -/* - * The hardirq mask has to be large enough to have - * space for potentially nestable IRQ sources in the system - * to nest on a single CPU. On Alpha, interrupts are masked at the CPU - * by IPL as well as at the system level. We only have 8 IPLs (UNIX PALcode) - * so we really only have 8 nestable IRQs, but allow some overhead - */ -#if (1 << HARDIRQ_BITS) < 16 -#error HARDIRQ_BITS is too low! -#endif - #endif /* _ALPHA_HARDIRQ_H */ Index: linux-2.6-tip/arch/alpha/include/asm/statfs.h =================================================================== --- linux-2.6-tip.orig/arch/alpha/include/asm/statfs.h +++ linux-2.6-tip/arch/alpha/include/asm/statfs.h @@ -1,6 +1,8 @@ #ifndef _ALPHA_STATFS_H #define _ALPHA_STATFS_H +#include + /* Alpha is the only 64-bit platform with 32-bit statfs. And doesn't even seem to implement statfs64 */ #define __statfs_word __u32 Index: linux-2.6-tip/arch/alpha/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/alpha/include/asm/swab.h +++ linux-2.6-tip/arch/alpha/include/asm/swab.h @@ -1,7 +1,7 @@ #ifndef _ALPHA_SWAB_H #define _ALPHA_SWAB_H -#include +#include #include #include Index: linux-2.6-tip/arch/alpha/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/alpha/kernel/irq.c +++ linux-2.6-tip/arch/alpha/kernel/irq.c @@ -55,7 +55,7 @@ int irq_select_affinity(unsigned int irq cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0); last_cpu = cpu; - irq_desc[irq].affinity = cpumask_of_cpu(cpu); + cpumask_copy(irq_desc[irq].affinity, cpumask_of(cpu)); irq_desc[irq].chip->set_affinity(irq, cpumask_of(cpu)); return 0; } @@ -90,7 +90,7 @@ show_interrupts(struct seq_file *p, void seq_printf(p, "%10u ", kstat_irqs(irq)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[irq]); + seq_printf(p, "%10u ", kstat_irqs_cpu(irq, j)); #endif seq_printf(p, " %14s", irq_desc[irq].chip->typename); seq_printf(p, " %c%s", Index: linux-2.6-tip/arch/alpha/kernel/irq_alpha.c =================================================================== --- linux-2.6-tip.orig/arch/alpha/kernel/irq_alpha.c +++ linux-2.6-tip/arch/alpha/kernel/irq_alpha.c @@ -64,7 +64,7 @@ do_entInt(unsigned long type, unsigned l smp_percpu_timer_interrupt(regs); cpu = smp_processor_id(); if (cpu != boot_cpuid) { - kstat_cpu(cpu).irqs[RTC_IRQ]++; + kstat_incr_irqs_this_cpu(RTC_IRQ, irq_to_desc(RTC_IRQ)); } else { handle_irq(RTC_IRQ); } Index: linux-2.6-tip/arch/arm/include/asm/a.out.h =================================================================== --- linux-2.6-tip.orig/arch/arm/include/asm/a.out.h +++ linux-2.6-tip/arch/arm/include/asm/a.out.h @@ -2,7 +2,7 @@ #define __ARM_A_OUT_H__ #include -#include +#include struct exec { Index: linux-2.6-tip/arch/arm/include/asm/setup.h =================================================================== --- linux-2.6-tip.orig/arch/arm/include/asm/setup.h +++ linux-2.6-tip/arch/arm/include/asm/setup.h @@ -14,7 +14,7 @@ #ifndef __ASMARM_SETUP_H #define __ASMARM_SETUP_H -#include +#include #define COMMAND_LINE_SIZE 1024 Index: linux-2.6-tip/arch/arm/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/arm/include/asm/swab.h +++ linux-2.6-tip/arch/arm/include/asm/swab.h @@ -16,7 +16,7 @@ #define __ASM_ARM_SWAB_H #include -#include +#include #if !defined(__STRICT_ANSI__) || defined(__KERNEL__) # define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/arm/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/arm/kernel/irq.c +++ linux-2.6-tip/arch/arm/kernel/irq.c @@ -76,7 +76,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%3d: ", i); for_each_present_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; action; action = action->next) @@ -104,6 +104,11 @@ static struct irq_desc bad_irq_desc = { .lock = __SPIN_LOCK_UNLOCKED(bad_irq_desc.lock), }; +#ifdef CONFIG_CPUMASK_OFFSTACK +/* We are not allocating bad_irq_desc.affinity or .pending_mask */ +#error "ARM architecture does not support CONFIG_CPUMASK_OFFSTACK." +#endif + /* * do_IRQ handles all hardware IRQ's. Decoded IRQs should not * come via this function. Instead, they should provide their @@ -161,7 +166,7 @@ void __init init_IRQ(void) irq_desc[irq].status |= IRQ_NOREQUEST | IRQ_NOPROBE; #ifdef CONFIG_SMP - bad_irq_desc.affinity = CPU_MASK_ALL; + cpumask_setall(bad_irq_desc.affinity); bad_irq_desc.cpu = smp_processor_id(); #endif init_arch_irq(); @@ -191,15 +196,16 @@ void migrate_irqs(void) struct irq_desc *desc = irq_desc + i; if (desc->cpu == cpu) { - unsigned int newcpu = any_online_cpu(desc->affinity); - - if (newcpu == NR_CPUS) { + unsigned int newcpu = cpumask_any_and(desc->affinity, + cpu_online_mask); + if (newcpu >= nr_cpu_ids) { if (printk_ratelimit()) printk(KERN_INFO "IRQ%u no longer affine to CPU%u\n", i, cpu); - cpus_setall(desc->affinity); - newcpu = any_online_cpu(desc->affinity); + cpumask_setall(desc->affinity); + newcpu = cpumask_any_and(desc->affinity, + cpu_online_mask); } route_irq(desc, i, newcpu); Index: linux-2.6-tip/arch/arm/kernel/vmlinux.lds.S =================================================================== --- linux-2.6-tip.orig/arch/arm/kernel/vmlinux.lds.S +++ linux-2.6-tip/arch/arm/kernel/vmlinux.lds.S @@ -65,6 +65,7 @@ SECTIONS #endif . = ALIGN(4096); __per_cpu_start = .; + *(.data.percpu.page_aligned) *(.data.percpu) *(.data.percpu.shared_aligned) __per_cpu_end = .; Index: linux-2.6-tip/arch/arm/mach-ns9xxx/irq.c =================================================================== --- linux-2.6-tip.orig/arch/arm/mach-ns9xxx/irq.c +++ linux-2.6-tip/arch/arm/mach-ns9xxx/irq.c @@ -63,7 +63,6 @@ static struct irq_chip ns9xxx_chip = { #else static void handle_prio_irq(unsigned int irq, struct irq_desc *desc) { - unsigned int cpu = smp_processor_id(); struct irqaction *action; irqreturn_t action_ret; @@ -72,7 +71,7 @@ static void handle_prio_irq(unsigned int BUG_ON(desc->status & IRQ_INPROGRESS); desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); - kstat_cpu(cpu).irqs[irq]++; + kstat_incr_irqs_this_cpu(irq, desc); action = desc->action; if (unlikely(!action || (desc->status & IRQ_DISABLED))) Index: linux-2.6-tip/arch/arm/oprofile/op_model_mpcore.c =================================================================== --- linux-2.6-tip.orig/arch/arm/oprofile/op_model_mpcore.c +++ linux-2.6-tip/arch/arm/oprofile/op_model_mpcore.c @@ -263,7 +263,7 @@ static void em_route_irq(int irq, unsign const struct cpumask *mask = cpumask_of(cpu); spin_lock_irq(&desc->lock); - desc->affinity = *mask; + cpumask_copy(desc->affinity, mask); desc->chip->set_affinity(irq, mask); spin_unlock_irq(&desc->lock); } Index: linux-2.6-tip/arch/avr32/include/asm/hardirq.h =================================================================== --- linux-2.6-tip.orig/arch/avr32/include/asm/hardirq.h +++ linux-2.6-tip/arch/avr32/include/asm/hardirq.h @@ -20,15 +20,4 @@ void ack_bad_irq(unsigned int irq); #endif /* __ASSEMBLY__ */ -#define HARDIRQ_BITS 12 - -/* - * The hardirq mask has to be large enough to have - * space for potentially all IRQ sources in the system - * nesting on a single CPU: - */ -#if (1 << HARDIRQ_BITS) < NR_IRQS -# error HARDIRQ_BITS is too low! -#endif - #endif /* __ASM_AVR32_HARDIRQ_H */ Index: linux-2.6-tip/arch/avr32/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/avr32/include/asm/swab.h +++ linux-2.6-tip/arch/avr32/include/asm/swab.h @@ -4,7 +4,7 @@ #ifndef __ASM_AVR32_SWAB_H #define __ASM_AVR32_SWAB_H -#include +#include #include #define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/avr32/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/avr32/kernel/irq.c +++ linux-2.6-tip/arch/avr32/kernel/irq.c @@ -58,7 +58,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%3d: ", i); for_each_online_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %8s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; action; action = action->next) Index: linux-2.6-tip/arch/blackfin/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/blackfin/include/asm/swab.h +++ linux-2.6-tip/arch/blackfin/include/asm/swab.h @@ -1,7 +1,7 @@ #ifndef _BLACKFIN_SWAB_H #define _BLACKFIN_SWAB_H -#include +#include #include #if defined(__GNUC__) && !defined(__STRICT_ANSI__) || defined(__KERNEL__) Index: linux-2.6-tip/arch/blackfin/kernel/irqchip.c =================================================================== --- linux-2.6-tip.orig/arch/blackfin/kernel/irqchip.c +++ linux-2.6-tip/arch/blackfin/kernel/irqchip.c @@ -70,6 +70,11 @@ static struct irq_desc bad_irq_desc = { #endif }; +#ifdef CONFIG_CPUMASK_OFFSTACK +/* We are not allocating a variable-sized bad_irq_desc.affinity */ +#error "Blackfin architecture does not support CONFIG_CPUMASK_OFFSTACK." +#endif + int show_interrupts(struct seq_file *p, void *v) { int i = *(loff_t *) v, j; @@ -83,7 +88,7 @@ int show_interrupts(struct seq_file *p, goto skip; seq_printf(p, "%3d: ", i); for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); seq_printf(p, " %8s", irq_desc[i].chip->name); seq_printf(p, " %s", action->name); for (action = action->next; action; action = action->next) Index: linux-2.6-tip/arch/cris/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/cris/kernel/irq.c +++ linux-2.6-tip/arch/cris/kernel/irq.c @@ -66,7 +66,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/arch/frv/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/frv/kernel/irq.c +++ linux-2.6-tip/arch/frv/kernel/irq.c @@ -74,7 +74,7 @@ int show_interrupts(struct seq_file *p, if (action) { seq_printf(p, "%3d: ", i); for_each_present_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-"); seq_printf(p, " %s", action->name); for (action = action->next; Index: linux-2.6-tip/arch/h8300/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/h8300/include/asm/swab.h +++ linux-2.6-tip/arch/h8300/include/asm/swab.h @@ -1,7 +1,7 @@ #ifndef _H8300_SWAB_H #define _H8300_SWAB_H -#include +#include #if defined(__GNUC__) && !defined(__STRICT_ANSI__) || defined(__KERNEL__) # define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/h8300/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/h8300/kernel/irq.c +++ linux-2.6-tip/arch/h8300/kernel/irq.c @@ -183,7 +183,7 @@ asmlinkage void do_IRQ(int irq) #if defined(CONFIG_PROC_FS) int show_interrupts(struct seq_file *p, void *v) { - int i = *(loff_t *) v, j; + int i = *(loff_t *) v; struct irqaction * action; unsigned long flags; @@ -196,7 +196,7 @@ int show_interrupts(struct seq_file *p, if (!action) goto unlock; seq_printf(p, "%3d: ",i); - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs(i)); seq_printf(p, " %14s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/arch/ia64/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/ia64/Kconfig +++ linux-2.6-tip/arch/ia64/Kconfig @@ -22,6 +22,9 @@ config IA64 select HAVE_OPROFILE select HAVE_KPROBES select HAVE_KRETPROBES + select HAVE_FTRACE_MCOUNT_RECORD + select HAVE_DYNAMIC_FTRACE if (!ITANIUM) + select HAVE_FUNCTION_TRACER select HAVE_DMA_ATTRS select HAVE_KVM select HAVE_ARCH_TRACEHOOK Index: linux-2.6-tip/arch/ia64/dig/Makefile =================================================================== --- linux-2.6-tip.orig/arch/ia64/dig/Makefile +++ linux-2.6-tip/arch/ia64/dig/Makefile @@ -7,8 +7,8 @@ obj-y := setup.o ifeq ($(CONFIG_DMAR), y) -obj-$(CONFIG_IA64_GENERIC) += machvec.o machvec_vtd.o dig_vtd_iommu.o +obj-$(CONFIG_IA64_GENERIC) += machvec.o machvec_vtd.o else obj-$(CONFIG_IA64_GENERIC) += machvec.o endif -obj-$(CONFIG_IA64_DIG_VTD) += dig_vtd_iommu.o + Index: linux-2.6-tip/arch/ia64/dig/dig_vtd_iommu.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/dig/dig_vtd_iommu.c +++ /dev/null @@ -1,59 +0,0 @@ -#include -#include -#include -#include - -void * -vtd_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, - gfp_t flags) -{ - return intel_alloc_coherent(dev, size, dma_handle, flags); -} -EXPORT_SYMBOL_GPL(vtd_alloc_coherent); - -void -vtd_free_coherent(struct device *dev, size_t size, void *vaddr, - dma_addr_t dma_handle) -{ - intel_free_coherent(dev, size, vaddr, dma_handle); -} -EXPORT_SYMBOL_GPL(vtd_free_coherent); - -dma_addr_t -vtd_map_single_attrs(struct device *dev, void *addr, size_t size, - int dir, struct dma_attrs *attrs) -{ - return intel_map_single(dev, (phys_addr_t)addr, size, dir); -} -EXPORT_SYMBOL_GPL(vtd_map_single_attrs); - -void -vtd_unmap_single_attrs(struct device *dev, dma_addr_t iova, size_t size, - int dir, struct dma_attrs *attrs) -{ - intel_unmap_single(dev, iova, size, dir); -} -EXPORT_SYMBOL_GPL(vtd_unmap_single_attrs); - -int -vtd_map_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, - int dir, struct dma_attrs *attrs) -{ - return intel_map_sg(dev, sglist, nents, dir); -} -EXPORT_SYMBOL_GPL(vtd_map_sg_attrs); - -void -vtd_unmap_sg_attrs(struct device *dev, struct scatterlist *sglist, - int nents, int dir, struct dma_attrs *attrs) -{ - intel_unmap_sg(dev, sglist, nents, dir); -} -EXPORT_SYMBOL_GPL(vtd_unmap_sg_attrs); - -int -vtd_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) -{ - return 0; -} -EXPORT_SYMBOL_GPL(vtd_dma_mapping_error); Index: linux-2.6-tip/arch/ia64/hp/common/hwsw_iommu.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/hp/common/hwsw_iommu.c +++ linux-2.6-tip/arch/ia64/hp/common/hwsw_iommu.c @@ -13,48 +13,33 @@ */ #include +#include #include - #include +extern struct dma_map_ops sba_dma_ops, swiotlb_dma_ops; + /* swiotlb declarations & definitions: */ extern int swiotlb_late_init_with_default_size (size_t size); -/* hwiommu declarations & definitions: */ - -extern ia64_mv_dma_alloc_coherent sba_alloc_coherent; -extern ia64_mv_dma_free_coherent sba_free_coherent; -extern ia64_mv_dma_map_single_attrs sba_map_single_attrs; -extern ia64_mv_dma_unmap_single_attrs sba_unmap_single_attrs; -extern ia64_mv_dma_map_sg_attrs sba_map_sg_attrs; -extern ia64_mv_dma_unmap_sg_attrs sba_unmap_sg_attrs; -extern ia64_mv_dma_supported sba_dma_supported; -extern ia64_mv_dma_mapping_error sba_dma_mapping_error; - -#define hwiommu_alloc_coherent sba_alloc_coherent -#define hwiommu_free_coherent sba_free_coherent -#define hwiommu_map_single_attrs sba_map_single_attrs -#define hwiommu_unmap_single_attrs sba_unmap_single_attrs -#define hwiommu_map_sg_attrs sba_map_sg_attrs -#define hwiommu_unmap_sg_attrs sba_unmap_sg_attrs -#define hwiommu_dma_supported sba_dma_supported -#define hwiommu_dma_mapping_error sba_dma_mapping_error -#define hwiommu_sync_single_for_cpu machvec_dma_sync_single -#define hwiommu_sync_sg_for_cpu machvec_dma_sync_sg -#define hwiommu_sync_single_for_device machvec_dma_sync_single -#define hwiommu_sync_sg_for_device machvec_dma_sync_sg - - /* * Note: we need to make the determination of whether or not to use * the sw I/O TLB based purely on the device structure. Anything else * would be unreliable or would be too intrusive. */ -static inline int -use_swiotlb (struct device *dev) +static inline int use_swiotlb(struct device *dev) +{ + return dev && dev->dma_mask && + !sba_dma_ops.dma_supported(dev, *dev->dma_mask); +} + +struct dma_map_ops *hwsw_dma_get_ops(struct device *dev) { - return dev && dev->dma_mask && !hwiommu_dma_supported(dev, *dev->dma_mask); + if (use_swiotlb(dev)) + return &swiotlb_dma_ops; + return &sba_dma_ops; } +EXPORT_SYMBOL(hwsw_dma_get_ops); void __init hwsw_init (void) @@ -71,125 +56,3 @@ hwsw_init (void) #endif } } - -void * -hwsw_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flags) -{ - if (use_swiotlb(dev)) - return swiotlb_alloc_coherent(dev, size, dma_handle, flags); - else - return hwiommu_alloc_coherent(dev, size, dma_handle, flags); -} - -void -hwsw_free_coherent (struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) -{ - if (use_swiotlb(dev)) - swiotlb_free_coherent(dev, size, vaddr, dma_handle); - else - hwiommu_free_coherent(dev, size, vaddr, dma_handle); -} - -dma_addr_t -hwsw_map_single_attrs(struct device *dev, void *addr, size_t size, int dir, - struct dma_attrs *attrs) -{ - if (use_swiotlb(dev)) - return swiotlb_map_single_attrs(dev, addr, size, dir, attrs); - else - return hwiommu_map_single_attrs(dev, addr, size, dir, attrs); -} -EXPORT_SYMBOL(hwsw_map_single_attrs); - -void -hwsw_unmap_single_attrs(struct device *dev, dma_addr_t iova, size_t size, - int dir, struct dma_attrs *attrs) -{ - if (use_swiotlb(dev)) - return swiotlb_unmap_single_attrs(dev, iova, size, dir, attrs); - else - return hwiommu_unmap_single_attrs(dev, iova, size, dir, attrs); -} -EXPORT_SYMBOL(hwsw_unmap_single_attrs); - -int -hwsw_map_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, - int dir, struct dma_attrs *attrs) -{ - if (use_swiotlb(dev)) - return swiotlb_map_sg_attrs(dev, sglist, nents, dir, attrs); - else - return hwiommu_map_sg_attrs(dev, sglist, nents, dir, attrs); -} -EXPORT_SYMBOL(hwsw_map_sg_attrs); - -void -hwsw_unmap_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, - int dir, struct dma_attrs *attrs) -{ - if (use_swiotlb(dev)) - return swiotlb_unmap_sg_attrs(dev, sglist, nents, dir, attrs); - else - return hwiommu_unmap_sg_attrs(dev, sglist, nents, dir, attrs); -} -EXPORT_SYMBOL(hwsw_unmap_sg_attrs); - -void -hwsw_sync_single_for_cpu (struct device *dev, dma_addr_t addr, size_t size, int dir) -{ - if (use_swiotlb(dev)) - swiotlb_sync_single_for_cpu(dev, addr, size, dir); - else - hwiommu_sync_single_for_cpu(dev, addr, size, dir); -} - -void -hwsw_sync_sg_for_cpu (struct device *dev, struct scatterlist *sg, int nelems, int dir) -{ - if (use_swiotlb(dev)) - swiotlb_sync_sg_for_cpu(dev, sg, nelems, dir); - else - hwiommu_sync_sg_for_cpu(dev, sg, nelems, dir); -} - -void -hwsw_sync_single_for_device (struct device *dev, dma_addr_t addr, size_t size, int dir) -{ - if (use_swiotlb(dev)) - swiotlb_sync_single_for_device(dev, addr, size, dir); - else - hwiommu_sync_single_for_device(dev, addr, size, dir); -} - -void -hwsw_sync_sg_for_device (struct device *dev, struct scatterlist *sg, int nelems, int dir) -{ - if (use_swiotlb(dev)) - swiotlb_sync_sg_for_device(dev, sg, nelems, dir); - else - hwiommu_sync_sg_for_device(dev, sg, nelems, dir); -} - -int -hwsw_dma_supported (struct device *dev, u64 mask) -{ - if (hwiommu_dma_supported(dev, mask)) - return 1; - return swiotlb_dma_supported(dev, mask); -} - -int -hwsw_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) -{ - return hwiommu_dma_mapping_error(dev, dma_addr) || - swiotlb_dma_mapping_error(dev, dma_addr); -} - -EXPORT_SYMBOL(hwsw_dma_mapping_error); -EXPORT_SYMBOL(hwsw_dma_supported); -EXPORT_SYMBOL(hwsw_alloc_coherent); -EXPORT_SYMBOL(hwsw_free_coherent); -EXPORT_SYMBOL(hwsw_sync_single_for_cpu); -EXPORT_SYMBOL(hwsw_sync_single_for_device); -EXPORT_SYMBOL(hwsw_sync_sg_for_cpu); -EXPORT_SYMBOL(hwsw_sync_sg_for_device); Index: linux-2.6-tip/arch/ia64/hp/common/sba_iommu.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/hp/common/sba_iommu.c +++ linux-2.6-tip/arch/ia64/hp/common/sba_iommu.c @@ -36,6 +36,7 @@ #include /* hweight64() */ #include #include +#include #include /* ia64_get_itc() */ #include @@ -908,11 +909,13 @@ sba_mark_invalid(struct ioc *ioc, dma_ad * * See Documentation/PCI/PCI-DMA-mapping.txt */ -dma_addr_t -sba_map_single_attrs(struct device *dev, void *addr, size_t size, int dir, - struct dma_attrs *attrs) +static dma_addr_t sba_map_page(struct device *dev, struct page *page, + unsigned long poff, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { struct ioc *ioc; + void *addr = page_address(page) + poff; dma_addr_t iovp; dma_addr_t offset; u64 *pdir_start; @@ -990,7 +993,14 @@ sba_map_single_attrs(struct device *dev, #endif return SBA_IOVA(ioc, iovp, offset); } -EXPORT_SYMBOL(sba_map_single_attrs); + +static dma_addr_t sba_map_single_attrs(struct device *dev, void *addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + return sba_map_page(dev, virt_to_page(addr), + (unsigned long)addr & ~PAGE_MASK, size, dir, attrs); +} #ifdef ENABLE_MARK_CLEAN static SBA_INLINE void @@ -1026,8 +1036,8 @@ sba_mark_clean(struct ioc *ioc, dma_addr * * See Documentation/PCI/PCI-DMA-mapping.txt */ -void sba_unmap_single_attrs(struct device *dev, dma_addr_t iova, size_t size, - int dir, struct dma_attrs *attrs) +static void sba_unmap_page(struct device *dev, dma_addr_t iova, size_t size, + enum dma_data_direction dir, struct dma_attrs *attrs) { struct ioc *ioc; #if DELAYED_RESOURCE_CNT > 0 @@ -1094,7 +1104,12 @@ void sba_unmap_single_attrs(struct devic spin_unlock_irqrestore(&ioc->res_lock, flags); #endif /* DELAYED_RESOURCE_CNT == 0 */ } -EXPORT_SYMBOL(sba_unmap_single_attrs); + +void sba_unmap_single_attrs(struct device *dev, dma_addr_t iova, size_t size, + enum dma_data_direction dir, struct dma_attrs *attrs) +{ + sba_unmap_page(dev, iova, size, dir, attrs); +} /** * sba_alloc_coherent - allocate/map shared mem for DMA @@ -1104,7 +1119,7 @@ EXPORT_SYMBOL(sba_unmap_single_attrs); * * See Documentation/PCI/PCI-DMA-mapping.txt */ -void * +static void * sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flags) { struct ioc *ioc; @@ -1167,7 +1182,8 @@ sba_alloc_coherent (struct device *dev, * * See Documentation/PCI/PCI-DMA-mapping.txt */ -void sba_free_coherent (struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) +static void sba_free_coherent (struct device *dev, size_t size, void *vaddr, + dma_addr_t dma_handle) { sba_unmap_single_attrs(dev, dma_handle, size, 0, NULL); free_pages((unsigned long) vaddr, get_order(size)); @@ -1422,8 +1438,9 @@ sba_coalesce_chunks(struct ioc *ioc, str * * See Documentation/PCI/PCI-DMA-mapping.txt */ -int sba_map_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, - int dir, struct dma_attrs *attrs) +static int sba_map_sg_attrs(struct device *dev, struct scatterlist *sglist, + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct ioc *ioc; int coalesced, filled = 0; @@ -1502,7 +1519,6 @@ int sba_map_sg_attrs(struct device *dev, return filled; } -EXPORT_SYMBOL(sba_map_sg_attrs); /** * sba_unmap_sg_attrs - unmap Scatter/Gather list @@ -1514,8 +1530,9 @@ EXPORT_SYMBOL(sba_map_sg_attrs); * * See Documentation/PCI/PCI-DMA-mapping.txt */ -void sba_unmap_sg_attrs(struct device *dev, struct scatterlist *sglist, - int nents, int dir, struct dma_attrs *attrs) +static void sba_unmap_sg_attrs(struct device *dev, struct scatterlist *sglist, + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs) { #ifdef ASSERT_PDIR_SANITY struct ioc *ioc; @@ -1551,7 +1568,6 @@ void sba_unmap_sg_attrs(struct device *d #endif } -EXPORT_SYMBOL(sba_unmap_sg_attrs); /************************************************************** * @@ -2064,6 +2080,8 @@ static struct acpi_driver acpi_sba_ioc_d }, }; +extern struct dma_map_ops swiotlb_dma_ops; + static int __init sba_init(void) { @@ -2077,6 +2095,7 @@ sba_init(void) * a successful kdump kernel boot is to use the swiotlb. */ if (is_kdump_kernel()) { + dma_ops = &swiotlb_dma_ops; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to initialize software I/O TLB:" " Try machvec=dig boot option"); @@ -2092,6 +2111,7 @@ sba_init(void) * If we didn't find something sba_iommu can claim, we * need to setup the swiotlb and switch to the dig machvec. */ + dma_ops = &swiotlb_dma_ops; if (swiotlb_late_init_with_default_size(64 * (1<<20)) != 0) panic("Unable to find SBA IOMMU or initialize " "software I/O TLB: Try machvec=dig boot option"); @@ -2138,15 +2158,13 @@ nosbagart(char *str) return 1; } -int -sba_dma_supported (struct device *dev, u64 mask) +static int sba_dma_supported (struct device *dev, u64 mask) { /* make sure it's at least 32bit capable */ return ((mask & 0xFFFFFFFFUL) == 0xFFFFFFFFUL); } -int -sba_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) +static int sba_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) { return 0; } @@ -2176,7 +2194,22 @@ sba_page_override(char *str) __setup("sbapagesize=",sba_page_override); -EXPORT_SYMBOL(sba_dma_mapping_error); -EXPORT_SYMBOL(sba_dma_supported); -EXPORT_SYMBOL(sba_alloc_coherent); -EXPORT_SYMBOL(sba_free_coherent); +struct dma_map_ops sba_dma_ops = { + .alloc_coherent = sba_alloc_coherent, + .free_coherent = sba_free_coherent, + .map_page = sba_map_page, + .unmap_page = sba_unmap_page, + .map_sg = sba_map_sg_attrs, + .unmap_sg = sba_unmap_sg_attrs, + .sync_single_for_cpu = machvec_dma_sync_single, + .sync_sg_for_cpu = machvec_dma_sync_sg, + .sync_single_for_device = machvec_dma_sync_single, + .sync_sg_for_device = machvec_dma_sync_sg, + .dma_supported = sba_dma_supported, + .mapping_error = sba_dma_mapping_error, +}; + +void sba_dma_init(void) +{ + dma_ops = &sba_dma_ops; +} Index: linux-2.6-tip/arch/ia64/include/asm/dma-mapping.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/dma-mapping.h +++ linux-2.6-tip/arch/ia64/include/asm/dma-mapping.h @@ -11,99 +11,128 @@ #define ARCH_HAS_DMA_GET_REQUIRED_MASK -struct dma_mapping_ops { - int (*mapping_error)(struct device *dev, - dma_addr_t dma_addr); - void* (*alloc_coherent)(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp); - void (*free_coherent)(struct device *dev, size_t size, - void *vaddr, dma_addr_t dma_handle); - dma_addr_t (*map_single)(struct device *hwdev, unsigned long ptr, - size_t size, int direction); - void (*unmap_single)(struct device *dev, dma_addr_t addr, - size_t size, int direction); - void (*sync_single_for_cpu)(struct device *hwdev, - dma_addr_t dma_handle, size_t size, - int direction); - void (*sync_single_for_device)(struct device *hwdev, - dma_addr_t dma_handle, size_t size, - int direction); - void (*sync_single_range_for_cpu)(struct device *hwdev, - dma_addr_t dma_handle, unsigned long offset, - size_t size, int direction); - void (*sync_single_range_for_device)(struct device *hwdev, - dma_addr_t dma_handle, unsigned long offset, - size_t size, int direction); - void (*sync_sg_for_cpu)(struct device *hwdev, - struct scatterlist *sg, int nelems, - int direction); - void (*sync_sg_for_device)(struct device *hwdev, - struct scatterlist *sg, int nelems, - int direction); - int (*map_sg)(struct device *hwdev, struct scatterlist *sg, - int nents, int direction); - void (*unmap_sg)(struct device *hwdev, - struct scatterlist *sg, int nents, - int direction); - int (*dma_supported_op)(struct device *hwdev, u64 mask); - int is_phys; -}; - -extern struct dma_mapping_ops *dma_ops; +extern struct dma_map_ops *dma_ops; extern struct ia64_machine_vector ia64_mv; extern void set_iommu_machvec(void); -#define dma_alloc_coherent(dev, size, handle, gfp) \ - platform_dma_alloc_coherent(dev, size, handle, (gfp) | GFP_DMA) +extern void machvec_dma_sync_single(struct device *, dma_addr_t, size_t, + enum dma_data_direction); +extern void machvec_dma_sync_sg(struct device *, struct scatterlist *, int, + enum dma_data_direction); -/* coherent mem. is cheap */ -static inline void * -dma_alloc_noncoherent(struct device *dev, size_t size, dma_addr_t *dma_handle, - gfp_t flag) +static inline void *dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *daddr, gfp_t gfp) { - return dma_alloc_coherent(dev, size, dma_handle, flag); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->alloc_coherent(dev, size, daddr, gfp); } -#define dma_free_coherent platform_dma_free_coherent -static inline void -dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle) + +static inline void dma_free_coherent(struct device *dev, size_t size, + void *caddr, dma_addr_t daddr) { - dma_free_coherent(dev, size, cpu_addr, dma_handle); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->free_coherent(dev, size, caddr, daddr); } -#define dma_map_single_attrs platform_dma_map_single_attrs -static inline dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, - size_t size, int dir) + +#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f) +#define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h) + +static inline dma_addr_t dma_map_single_attrs(struct device *dev, + void *caddr, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - return dma_map_single_attrs(dev, cpu_addr, size, dir, NULL); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->map_page(dev, virt_to_page(caddr), + (unsigned long)caddr & ~PAGE_MASK, size, + dir, attrs); } -#define dma_map_sg_attrs platform_dma_map_sg_attrs -static inline int dma_map_sg(struct device *dev, struct scatterlist *sgl, - int nents, int dir) + +static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t daddr, + size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - return dma_map_sg_attrs(dev, sgl, nents, dir, NULL); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->unmap_page(dev, daddr, size, dir, attrs); } -#define dma_unmap_single_attrs platform_dma_unmap_single_attrs -static inline void dma_unmap_single(struct device *dev, dma_addr_t cpu_addr, - size_t size, int dir) + +#define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, NULL) +#define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, NULL) + +static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs) { - return dma_unmap_single_attrs(dev, cpu_addr, size, dir, NULL); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->map_sg(dev, sgl, nents, dir, attrs); } -#define dma_unmap_sg_attrs platform_dma_unmap_sg_attrs -static inline void dma_unmap_sg(struct device *dev, struct scatterlist *sgl, - int nents, int dir) + +static inline void dma_unmap_sg_attrs(struct device *dev, + struct scatterlist *sgl, int nents, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - return dma_unmap_sg_attrs(dev, sgl, nents, dir, NULL); + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->unmap_sg(dev, sgl, nents, dir, attrs); } -#define dma_sync_single_for_cpu platform_dma_sync_single_for_cpu -#define dma_sync_sg_for_cpu platform_dma_sync_sg_for_cpu -#define dma_sync_single_for_device platform_dma_sync_single_for_device -#define dma_sync_sg_for_device platform_dma_sync_sg_for_device -#define dma_mapping_error platform_dma_mapping_error -#define dma_map_page(dev, pg, off, size, dir) \ - dma_map_single(dev, page_address(pg) + (off), (size), (dir)) -#define dma_unmap_page(dev, dma_addr, size, dir) \ - dma_unmap_single(dev, dma_addr, size, dir) +#define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, NULL) +#define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, NULL) + +static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t daddr, + size_t size, + enum dma_data_direction dir) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->sync_single_for_cpu(dev, daddr, size, dir); +} + +static inline void dma_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sgl, + int nents, enum dma_data_direction dir) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->sync_sg_for_cpu(dev, sgl, nents, dir); +} + +static inline void dma_sync_single_for_device(struct device *dev, + dma_addr_t daddr, + size_t size, + enum dma_data_direction dir) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->sync_single_for_device(dev, daddr, size, dir); +} + +static inline void dma_sync_sg_for_device(struct device *dev, + struct scatterlist *sgl, + int nents, + enum dma_data_direction dir) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + ops->sync_sg_for_device(dev, sgl, nents, dir); +} + +static inline int dma_mapping_error(struct device *dev, dma_addr_t daddr) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->mapping_error(dev, daddr); +} + +static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, + size_t offset, size_t size, + enum dma_data_direction dir) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->map_page(dev, page, offset, size, dir, NULL); +} + +static inline void dma_unmap_page(struct device *dev, dma_addr_t addr, + size_t size, enum dma_data_direction dir) +{ + dma_unmap_single(dev, addr, size, dir); +} /* * Rest of this file is part of the "Advanced DMA API". Use at your own risk. @@ -115,7 +144,11 @@ static inline void dma_unmap_sg(struct d #define dma_sync_single_range_for_device(dev, dma_handle, offset, size, dir) \ dma_sync_single_for_device(dev, dma_handle, size, dir) -#define dma_supported platform_dma_supported +static inline int dma_supported(struct device *dev, u64 mask) +{ + struct dma_map_ops *ops = platform_dma_get_ops(dev); + return ops->dma_supported(dev, mask); +} static inline int dma_set_mask (struct device *dev, u64 mask) @@ -141,11 +174,4 @@ dma_cache_sync (struct device *dev, void #define dma_is_consistent(d, h) (1) /* all we do is coherent memory... */ -static inline struct dma_mapping_ops *get_dma_ops(struct device *dev) -{ - return dma_ops; -} - - - #endif /* _ASM_IA64_DMA_MAPPING_H */ Index: linux-2.6-tip/arch/ia64/include/asm/fpu.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/fpu.h +++ linux-2.6-tip/arch/ia64/include/asm/fpu.h @@ -6,8 +6,6 @@ * David Mosberger-Tang */ -#include - /* floating point status register: */ #define FPSR_TRAP_VD (1 << 0) /* invalid op trap disabled */ #define FPSR_TRAP_DD (1 << 1) /* denormal trap disabled */ Index: linux-2.6-tip/arch/ia64/include/asm/ftrace.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/ia64/include/asm/ftrace.h @@ -0,0 +1,28 @@ +#ifndef _ASM_IA64_FTRACE_H +#define _ASM_IA64_FTRACE_H + +#ifdef CONFIG_FUNCTION_TRACER +#define MCOUNT_INSN_SIZE 32 /* sizeof mcount call */ + +#ifndef __ASSEMBLY__ +extern void _mcount(unsigned long pfs, unsigned long r1, unsigned long b0, unsigned long r0); +#define mcount _mcount + +#include +/* In IA64, MCOUNT_ADDR is set in link time, so it's not a constant at compile time */ +#define MCOUNT_ADDR (((struct fnptr *)mcount)->ip) +#define FTRACE_ADDR (((struct fnptr *)ftrace_caller)->ip) + +static inline unsigned long ftrace_call_adjust(unsigned long addr) +{ + /* second bundle, insn 2 */ + return addr - 0x12; +} + +struct dyn_arch_ftrace { +}; +#endif + +#endif /* CONFIG_FUNCTION_TRACER */ + +#endif /* _ASM_IA64_FTRACE_H */ Index: linux-2.6-tip/arch/ia64/include/asm/gcc_intrin.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/gcc_intrin.h +++ linux-2.6-tip/arch/ia64/include/asm/gcc_intrin.h @@ -6,6 +6,7 @@ * Copyright (C) 2002,2003 Suresh Siddha */ +#include #include /* define this macro to get some asm stmts included in 'c' files */ Index: linux-2.6-tip/arch/ia64/include/asm/hardirq.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/hardirq.h +++ linux-2.6-tip/arch/ia64/include/asm/hardirq.h @@ -20,16 +20,6 @@ #define local_softirq_pending() (local_cpu_data->softirq_pending) -#define HARDIRQ_BITS 14 - -/* - * The hardirq mask has to be large enough to have space for potentially all IRQ sources - * in the system nesting on a single CPU: - */ -#if (1 << HARDIRQ_BITS) < NR_IRQS -# error HARDIRQ_BITS is too low! -#endif - extern void __iomem *ipi_base_addr; void ack_bad_irq(unsigned int irq); Index: linux-2.6-tip/arch/ia64/include/asm/intrinsics.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/intrinsics.h +++ linux-2.6-tip/arch/ia64/include/asm/intrinsics.h @@ -10,6 +10,7 @@ #ifndef __ASSEMBLY__ +#include /* include compiler specific intrinsics */ #include #ifdef __INTEL_COMPILER Index: linux-2.6-tip/arch/ia64/include/asm/kvm.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/kvm.h +++ linux-2.6-tip/arch/ia64/include/asm/kvm.h @@ -21,8 +21,7 @@ * */ -#include - +#include #include /* Select x86 specific features in */ Index: linux-2.6-tip/arch/ia64/include/asm/machvec.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/machvec.h +++ linux-2.6-tip/arch/ia64/include/asm/machvec.h @@ -11,7 +11,6 @@ #define _ASM_IA64_MACHVEC_H #include -#include /* forward declarations: */ struct device; @@ -45,24 +44,8 @@ typedef void ia64_mv_kernel_launch_event /* DMA-mapping interface: */ typedef void ia64_mv_dma_init (void); -typedef void *ia64_mv_dma_alloc_coherent (struct device *, size_t, dma_addr_t *, gfp_t); -typedef void ia64_mv_dma_free_coherent (struct device *, size_t, void *, dma_addr_t); -typedef dma_addr_t ia64_mv_dma_map_single (struct device *, void *, size_t, int); -typedef void ia64_mv_dma_unmap_single (struct device *, dma_addr_t, size_t, int); -typedef int ia64_mv_dma_map_sg (struct device *, struct scatterlist *, int, int); -typedef void ia64_mv_dma_unmap_sg (struct device *, struct scatterlist *, int, int); -typedef void ia64_mv_dma_sync_single_for_cpu (struct device *, dma_addr_t, size_t, int); -typedef void ia64_mv_dma_sync_sg_for_cpu (struct device *, struct scatterlist *, int, int); -typedef void ia64_mv_dma_sync_single_for_device (struct device *, dma_addr_t, size_t, int); -typedef void ia64_mv_dma_sync_sg_for_device (struct device *, struct scatterlist *, int, int); -typedef int ia64_mv_dma_mapping_error(struct device *, dma_addr_t dma_addr); -typedef int ia64_mv_dma_supported (struct device *, u64); - -typedef dma_addr_t ia64_mv_dma_map_single_attrs (struct device *, void *, size_t, int, struct dma_attrs *); -typedef void ia64_mv_dma_unmap_single_attrs (struct device *, dma_addr_t, size_t, int, struct dma_attrs *); -typedef int ia64_mv_dma_map_sg_attrs (struct device *, struct scatterlist *, int, int, struct dma_attrs *); -typedef void ia64_mv_dma_unmap_sg_attrs (struct device *, struct scatterlist *, int, int, struct dma_attrs *); typedef u64 ia64_mv_dma_get_required_mask (struct device *); +typedef struct dma_map_ops *ia64_mv_dma_get_ops(struct device *); /* * WARNING: The legacy I/O space is _architected_. Platforms are @@ -114,8 +97,6 @@ machvec_noop_bus (struct pci_bus *bus) extern void machvec_setup (char **); extern void machvec_timer_interrupt (int, void *); -extern void machvec_dma_sync_single (struct device *, dma_addr_t, size_t, int); -extern void machvec_dma_sync_sg (struct device *, struct scatterlist *, int, int); extern void machvec_tlb_migrate_finish (struct mm_struct *); # if defined (CONFIG_IA64_HP_SIM) @@ -148,19 +129,8 @@ extern void machvec_tlb_migrate_finish ( # define platform_global_tlb_purge ia64_mv.global_tlb_purge # define platform_tlb_migrate_finish ia64_mv.tlb_migrate_finish # define platform_dma_init ia64_mv.dma_init -# define platform_dma_alloc_coherent ia64_mv.dma_alloc_coherent -# define platform_dma_free_coherent ia64_mv.dma_free_coherent -# define platform_dma_map_single_attrs ia64_mv.dma_map_single_attrs -# define platform_dma_unmap_single_attrs ia64_mv.dma_unmap_single_attrs -# define platform_dma_map_sg_attrs ia64_mv.dma_map_sg_attrs -# define platform_dma_unmap_sg_attrs ia64_mv.dma_unmap_sg_attrs -# define platform_dma_sync_single_for_cpu ia64_mv.dma_sync_single_for_cpu -# define platform_dma_sync_sg_for_cpu ia64_mv.dma_sync_sg_for_cpu -# define platform_dma_sync_single_for_device ia64_mv.dma_sync_single_for_device -# define platform_dma_sync_sg_for_device ia64_mv.dma_sync_sg_for_device -# define platform_dma_mapping_error ia64_mv.dma_mapping_error -# define platform_dma_supported ia64_mv.dma_supported # define platform_dma_get_required_mask ia64_mv.dma_get_required_mask +# define platform_dma_get_ops ia64_mv.dma_get_ops # define platform_irq_to_vector ia64_mv.irq_to_vector # define platform_local_vector_to_irq ia64_mv.local_vector_to_irq # define platform_pci_get_legacy_mem ia64_mv.pci_get_legacy_mem @@ -203,19 +173,8 @@ struct ia64_machine_vector { ia64_mv_global_tlb_purge_t *global_tlb_purge; ia64_mv_tlb_migrate_finish_t *tlb_migrate_finish; ia64_mv_dma_init *dma_init; - ia64_mv_dma_alloc_coherent *dma_alloc_coherent; - ia64_mv_dma_free_coherent *dma_free_coherent; - ia64_mv_dma_map_single_attrs *dma_map_single_attrs; - ia64_mv_dma_unmap_single_attrs *dma_unmap_single_attrs; - ia64_mv_dma_map_sg_attrs *dma_map_sg_attrs; - ia64_mv_dma_unmap_sg_attrs *dma_unmap_sg_attrs; - ia64_mv_dma_sync_single_for_cpu *dma_sync_single_for_cpu; - ia64_mv_dma_sync_sg_for_cpu *dma_sync_sg_for_cpu; - ia64_mv_dma_sync_single_for_device *dma_sync_single_for_device; - ia64_mv_dma_sync_sg_for_device *dma_sync_sg_for_device; - ia64_mv_dma_mapping_error *dma_mapping_error; - ia64_mv_dma_supported *dma_supported; ia64_mv_dma_get_required_mask *dma_get_required_mask; + ia64_mv_dma_get_ops *dma_get_ops; ia64_mv_irq_to_vector *irq_to_vector; ia64_mv_local_vector_to_irq *local_vector_to_irq; ia64_mv_pci_get_legacy_mem_t *pci_get_legacy_mem; @@ -254,19 +213,8 @@ struct ia64_machine_vector { platform_global_tlb_purge, \ platform_tlb_migrate_finish, \ platform_dma_init, \ - platform_dma_alloc_coherent, \ - platform_dma_free_coherent, \ - platform_dma_map_single_attrs, \ - platform_dma_unmap_single_attrs, \ - platform_dma_map_sg_attrs, \ - platform_dma_unmap_sg_attrs, \ - platform_dma_sync_single_for_cpu, \ - platform_dma_sync_sg_for_cpu, \ - platform_dma_sync_single_for_device, \ - platform_dma_sync_sg_for_device, \ - platform_dma_mapping_error, \ - platform_dma_supported, \ platform_dma_get_required_mask, \ + platform_dma_get_ops, \ platform_irq_to_vector, \ platform_local_vector_to_irq, \ platform_pci_get_legacy_mem, \ @@ -302,6 +250,9 @@ extern void machvec_init_from_cmdline(co # error Unknown configuration. Update arch/ia64/include/asm/machvec.h. # endif /* CONFIG_IA64_GENERIC */ +extern void swiotlb_dma_init(void); +extern struct dma_map_ops *dma_get_ops(struct device *); + /* * Define default versions so we can extend machvec for new platforms without having * to update the machvec files for all existing platforms. @@ -332,43 +283,10 @@ extern void machvec_init_from_cmdline(co # define platform_kernel_launch_event machvec_noop #endif #ifndef platform_dma_init -# define platform_dma_init swiotlb_init -#endif -#ifndef platform_dma_alloc_coherent -# define platform_dma_alloc_coherent swiotlb_alloc_coherent -#endif -#ifndef platform_dma_free_coherent -# define platform_dma_free_coherent swiotlb_free_coherent -#endif -#ifndef platform_dma_map_single_attrs -# define platform_dma_map_single_attrs swiotlb_map_single_attrs -#endif -#ifndef platform_dma_unmap_single_attrs -# define platform_dma_unmap_single_attrs swiotlb_unmap_single_attrs -#endif -#ifndef platform_dma_map_sg_attrs -# define platform_dma_map_sg_attrs swiotlb_map_sg_attrs -#endif -#ifndef platform_dma_unmap_sg_attrs -# define platform_dma_unmap_sg_attrs swiotlb_unmap_sg_attrs -#endif -#ifndef platform_dma_sync_single_for_cpu -# define platform_dma_sync_single_for_cpu swiotlb_sync_single_for_cpu -#endif -#ifndef platform_dma_sync_sg_for_cpu -# define platform_dma_sync_sg_for_cpu swiotlb_sync_sg_for_cpu -#endif -#ifndef platform_dma_sync_single_for_device -# define platform_dma_sync_single_for_device swiotlb_sync_single_for_device -#endif -#ifndef platform_dma_sync_sg_for_device -# define platform_dma_sync_sg_for_device swiotlb_sync_sg_for_device -#endif -#ifndef platform_dma_mapping_error -# define platform_dma_mapping_error swiotlb_dma_mapping_error +# define platform_dma_init swiotlb_dma_init #endif -#ifndef platform_dma_supported -# define platform_dma_supported swiotlb_dma_supported +#ifndef platform_dma_get_ops +# define platform_dma_get_ops dma_get_ops #endif #ifndef platform_dma_get_required_mask # define platform_dma_get_required_mask ia64_dma_get_required_mask Index: linux-2.6-tip/arch/ia64/include/asm/machvec_dig_vtd.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/machvec_dig_vtd.h +++ linux-2.6-tip/arch/ia64/include/asm/machvec_dig_vtd.h @@ -2,14 +2,6 @@ #define _ASM_IA64_MACHVEC_DIG_VTD_h extern ia64_mv_setup_t dig_setup; -extern ia64_mv_dma_alloc_coherent vtd_alloc_coherent; -extern ia64_mv_dma_free_coherent vtd_free_coherent; -extern ia64_mv_dma_map_single_attrs vtd_map_single_attrs; -extern ia64_mv_dma_unmap_single_attrs vtd_unmap_single_attrs; -extern ia64_mv_dma_map_sg_attrs vtd_map_sg_attrs; -extern ia64_mv_dma_unmap_sg_attrs vtd_unmap_sg_attrs; -extern ia64_mv_dma_supported iommu_dma_supported; -extern ia64_mv_dma_mapping_error vtd_dma_mapping_error; extern ia64_mv_dma_init pci_iommu_alloc; /* @@ -22,17 +14,5 @@ extern ia64_mv_dma_init pci_iommu_allo #define platform_name "dig_vtd" #define platform_setup dig_setup #define platform_dma_init pci_iommu_alloc -#define platform_dma_alloc_coherent vtd_alloc_coherent -#define platform_dma_free_coherent vtd_free_coherent -#define platform_dma_map_single_attrs vtd_map_single_attrs -#define platform_dma_unmap_single_attrs vtd_unmap_single_attrs -#define platform_dma_map_sg_attrs vtd_map_sg_attrs -#define platform_dma_unmap_sg_attrs vtd_unmap_sg_attrs -#define platform_dma_sync_single_for_cpu machvec_dma_sync_single -#define platform_dma_sync_sg_for_cpu machvec_dma_sync_sg -#define platform_dma_sync_single_for_device machvec_dma_sync_single -#define platform_dma_sync_sg_for_device machvec_dma_sync_sg -#define platform_dma_supported iommu_dma_supported -#define platform_dma_mapping_error vtd_dma_mapping_error #endif /* _ASM_IA64_MACHVEC_DIG_VTD_h */ Index: linux-2.6-tip/arch/ia64/include/asm/machvec_hpzx1.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/machvec_hpzx1.h +++ linux-2.6-tip/arch/ia64/include/asm/machvec_hpzx1.h @@ -2,14 +2,7 @@ #define _ASM_IA64_MACHVEC_HPZX1_h extern ia64_mv_setup_t dig_setup; -extern ia64_mv_dma_alloc_coherent sba_alloc_coherent; -extern ia64_mv_dma_free_coherent sba_free_coherent; -extern ia64_mv_dma_map_single_attrs sba_map_single_attrs; -extern ia64_mv_dma_unmap_single_attrs sba_unmap_single_attrs; -extern ia64_mv_dma_map_sg_attrs sba_map_sg_attrs; -extern ia64_mv_dma_unmap_sg_attrs sba_unmap_sg_attrs; -extern ia64_mv_dma_supported sba_dma_supported; -extern ia64_mv_dma_mapping_error sba_dma_mapping_error; +extern ia64_mv_dma_init sba_dma_init; /* * This stuff has dual use! @@ -20,18 +13,6 @@ extern ia64_mv_dma_mapping_error sba_dma */ #define platform_name "hpzx1" #define platform_setup dig_setup -#define platform_dma_init machvec_noop -#define platform_dma_alloc_coherent sba_alloc_coherent -#define platform_dma_free_coherent sba_free_coherent -#define platform_dma_map_single_attrs sba_map_single_attrs -#define platform_dma_unmap_single_attrs sba_unmap_single_attrs -#define platform_dma_map_sg_attrs sba_map_sg_attrs -#define platform_dma_unmap_sg_attrs sba_unmap_sg_attrs -#define platform_dma_sync_single_for_cpu machvec_dma_sync_single -#define platform_dma_sync_sg_for_cpu machvec_dma_sync_sg -#define platform_dma_sync_single_for_device machvec_dma_sync_single -#define platform_dma_sync_sg_for_device machvec_dma_sync_sg -#define platform_dma_supported sba_dma_supported -#define platform_dma_mapping_error sba_dma_mapping_error +#define platform_dma_init sba_dma_init #endif /* _ASM_IA64_MACHVEC_HPZX1_h */ Index: linux-2.6-tip/arch/ia64/include/asm/machvec_hpzx1_swiotlb.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/machvec_hpzx1_swiotlb.h +++ linux-2.6-tip/arch/ia64/include/asm/machvec_hpzx1_swiotlb.h @@ -2,18 +2,7 @@ #define _ASM_IA64_MACHVEC_HPZX1_SWIOTLB_h extern ia64_mv_setup_t dig_setup; -extern ia64_mv_dma_alloc_coherent hwsw_alloc_coherent; -extern ia64_mv_dma_free_coherent hwsw_free_coherent; -extern ia64_mv_dma_map_single_attrs hwsw_map_single_attrs; -extern ia64_mv_dma_unmap_single_attrs hwsw_unmap_single_attrs; -extern ia64_mv_dma_map_sg_attrs hwsw_map_sg_attrs; -extern ia64_mv_dma_unmap_sg_attrs hwsw_unmap_sg_attrs; -extern ia64_mv_dma_supported hwsw_dma_supported; -extern ia64_mv_dma_mapping_error hwsw_dma_mapping_error; -extern ia64_mv_dma_sync_single_for_cpu hwsw_sync_single_for_cpu; -extern ia64_mv_dma_sync_sg_for_cpu hwsw_sync_sg_for_cpu; -extern ia64_mv_dma_sync_single_for_device hwsw_sync_single_for_device; -extern ia64_mv_dma_sync_sg_for_device hwsw_sync_sg_for_device; +extern ia64_mv_dma_get_ops hwsw_dma_get_ops; /* * This stuff has dual use! @@ -23,20 +12,8 @@ extern ia64_mv_dma_sync_sg_for_device h * the macros are used directly. */ #define platform_name "hpzx1_swiotlb" - #define platform_setup dig_setup #define platform_dma_init machvec_noop -#define platform_dma_alloc_coherent hwsw_alloc_coherent -#define platform_dma_free_coherent hwsw_free_coherent -#define platform_dma_map_single_attrs hwsw_map_single_attrs -#define platform_dma_unmap_single_attrs hwsw_unmap_single_attrs -#define platform_dma_map_sg_attrs hwsw_map_sg_attrs -#define platform_dma_unmap_sg_attrs hwsw_unmap_sg_attrs -#define platform_dma_supported hwsw_dma_supported -#define platform_dma_mapping_error hwsw_dma_mapping_error -#define platform_dma_sync_single_for_cpu hwsw_sync_single_for_cpu -#define platform_dma_sync_sg_for_cpu hwsw_sync_sg_for_cpu -#define platform_dma_sync_single_for_device hwsw_sync_single_for_device -#define platform_dma_sync_sg_for_device hwsw_sync_sg_for_device +#define platform_dma_get_ops hwsw_dma_get_ops #endif /* _ASM_IA64_MACHVEC_HPZX1_SWIOTLB_h */ Index: linux-2.6-tip/arch/ia64/include/asm/machvec_sn2.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/machvec_sn2.h +++ linux-2.6-tip/arch/ia64/include/asm/machvec_sn2.h @@ -55,19 +55,8 @@ extern ia64_mv_readb_t __sn_readb_relaxe extern ia64_mv_readw_t __sn_readw_relaxed; extern ia64_mv_readl_t __sn_readl_relaxed; extern ia64_mv_readq_t __sn_readq_relaxed; -extern ia64_mv_dma_alloc_coherent sn_dma_alloc_coherent; -extern ia64_mv_dma_free_coherent sn_dma_free_coherent; -extern ia64_mv_dma_map_single_attrs sn_dma_map_single_attrs; -extern ia64_mv_dma_unmap_single_attrs sn_dma_unmap_single_attrs; -extern ia64_mv_dma_map_sg_attrs sn_dma_map_sg_attrs; -extern ia64_mv_dma_unmap_sg_attrs sn_dma_unmap_sg_attrs; -extern ia64_mv_dma_sync_single_for_cpu sn_dma_sync_single_for_cpu; -extern ia64_mv_dma_sync_sg_for_cpu sn_dma_sync_sg_for_cpu; -extern ia64_mv_dma_sync_single_for_device sn_dma_sync_single_for_device; -extern ia64_mv_dma_sync_sg_for_device sn_dma_sync_sg_for_device; -extern ia64_mv_dma_mapping_error sn_dma_mapping_error; -extern ia64_mv_dma_supported sn_dma_supported; extern ia64_mv_dma_get_required_mask sn_dma_get_required_mask; +extern ia64_mv_dma_init sn_dma_init; extern ia64_mv_migrate_t sn_migrate; extern ia64_mv_kernel_launch_event_t sn_kernel_launch_event; extern ia64_mv_setup_msi_irq_t sn_setup_msi_irq; @@ -111,20 +100,8 @@ extern ia64_mv_pci_fixup_bus_t sn_pci_f #define platform_pci_get_legacy_mem sn_pci_get_legacy_mem #define platform_pci_legacy_read sn_pci_legacy_read #define platform_pci_legacy_write sn_pci_legacy_write -#define platform_dma_init machvec_noop -#define platform_dma_alloc_coherent sn_dma_alloc_coherent -#define platform_dma_free_coherent sn_dma_free_coherent -#define platform_dma_map_single_attrs sn_dma_map_single_attrs -#define platform_dma_unmap_single_attrs sn_dma_unmap_single_attrs -#define platform_dma_map_sg_attrs sn_dma_map_sg_attrs -#define platform_dma_unmap_sg_attrs sn_dma_unmap_sg_attrs -#define platform_dma_sync_single_for_cpu sn_dma_sync_single_for_cpu -#define platform_dma_sync_sg_for_cpu sn_dma_sync_sg_for_cpu -#define platform_dma_sync_single_for_device sn_dma_sync_single_for_device -#define platform_dma_sync_sg_for_device sn_dma_sync_sg_for_device -#define platform_dma_mapping_error sn_dma_mapping_error -#define platform_dma_supported sn_dma_supported #define platform_dma_get_required_mask sn_dma_get_required_mask +#define platform_dma_init sn_dma_init #define platform_migrate sn_migrate #define platform_kernel_launch_event sn_kernel_launch_event #ifdef CONFIG_PCI_MSI Index: linux-2.6-tip/arch/ia64/include/asm/percpu.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/percpu.h +++ linux-2.6-tip/arch/ia64/include/asm/percpu.h @@ -27,12 +27,12 @@ extern void *per_cpu_init(void); #else /* ! SMP */ -#define PER_CPU_ATTRIBUTES __attribute__((__section__(".data.percpu"))) - #define per_cpu_init() (__phys_per_cpu_start) #endif /* SMP */ +#define PER_CPU_BASE_SECTION ".data.percpu" + /* * Be extremely careful when taking the address of this variable! Due to virtual * remapping, it is different from the canonical address returned by __get_cpu_var(var)! Index: linux-2.6-tip/arch/ia64/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/swab.h +++ linux-2.6-tip/arch/ia64/include/asm/swab.h @@ -6,7 +6,7 @@ * David Mosberger-Tang , Hewlett-Packard Co. */ -#include +#include #include #include Index: linux-2.6-tip/arch/ia64/include/asm/topology.h =================================================================== --- linux-2.6-tip.orig/arch/ia64/include/asm/topology.h +++ linux-2.6-tip/arch/ia64/include/asm/topology.h @@ -84,7 +84,7 @@ void build_cpu_to_node_map(void); .child = NULL, \ .groups = NULL, \ .min_interval = 8, \ - .max_interval = 8*(min(num_online_cpus(), 32)), \ + .max_interval = 8*(min(num_online_cpus(), 32U)), \ .busy_factor = 64, \ .imbalance_pct = 125, \ .cache_nice_tries = 2, \ Index: linux-2.6-tip/arch/ia64/include/asm/uv/uv.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/ia64/include/asm/uv/uv.h @@ -0,0 +1,13 @@ +#ifndef _ASM_IA64_UV_UV_H +#define _ASM_IA64_UV_UV_H + +#include +#include + +static inline int is_uv_system(void) +{ + /* temporary support for running on hardware simulator */ + return IS_MEDUSA() || ia64_platform_is("uv"); +} + +#endif /* _ASM_IA64_UV_UV_H */ Index: linux-2.6-tip/arch/ia64/kernel/Makefile =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/Makefile +++ linux-2.6-tip/arch/ia64/kernel/Makefile @@ -2,12 +2,16 @@ # Makefile for the linux kernel. # +ifdef CONFIG_DYNAMIC_FTRACE +CFLAGS_REMOVE_ftrace.o = -pg +endif + extra-y := head.o init_task.o vmlinux.lds obj-y := acpi.o entry.o efi.o efi_stub.o gate-data.o fsys.o ia64_ksyms.o irq.o irq_ia64.o \ irq_lsapic.o ivt.o machvec.o pal.o patch.o process.o perfmon.o ptrace.o sal.o \ salinfo.o setup.o signal.o sys_ia64.o time.o traps.o unaligned.o \ - unwind.o mca.o mca_asm.o topology.o + unwind.o mca.o mca_asm.o topology.o dma-mapping.o obj-$(CONFIG_IA64_BRL_EMU) += brl_emu.o obj-$(CONFIG_IA64_GENERIC) += acpi-ext.o @@ -28,6 +32,7 @@ obj-$(CONFIG_IA64_CYCLONE) += cyclone.o obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_IA64_MCA_RECOVERY) += mca_recovery.o obj-$(CONFIG_KPROBES) += kprobes.o jprobes.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o crash.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_IA64_UNCACHED_ALLOCATOR) += uncached.o @@ -43,9 +48,7 @@ ifneq ($(CONFIG_IA64_ESI),) obj-y += esi_stub.o # must be in kernel proper endif obj-$(CONFIG_DMAR) += pci-dma.o -ifeq ($(CONFIG_DMAR), y) obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o -endif # The gate DSO image is built using a special linker script. targets += gate.so gate-syms.o Index: linux-2.6-tip/arch/ia64/kernel/acpi.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/acpi.c +++ linux-2.6-tip/arch/ia64/kernel/acpi.c @@ -199,6 +199,10 @@ char *__init __acpi_map_table(unsigned l return __va(phys_addr); } +void __init __acpi_unmap_table(char *map, unsigned long size) +{ +} + /* -------------------------------------------------------------------------- Boot-time Table Parsing -------------------------------------------------------------------------- */ Index: linux-2.6-tip/arch/ia64/kernel/dma-mapping.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/ia64/kernel/dma-mapping.c @@ -0,0 +1,13 @@ +#include + +/* Set this to 1 if there is a HW IOMMU in the system */ +int iommu_detected __read_mostly; + +struct dma_map_ops *dma_ops; +EXPORT_SYMBOL(dma_ops); + +struct dma_map_ops *dma_get_ops(struct device *dev) +{ + return dma_ops; +} +EXPORT_SYMBOL(dma_get_ops); Index: linux-2.6-tip/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/entry.S +++ linux-2.6-tip/arch/ia64/kernel/entry.S @@ -47,6 +47,7 @@ #include #include #include +#include #include "minstate.h" @@ -1404,6 +1405,105 @@ GLOBAL_ENTRY(unw_init_running) br.ret.sptk.many rp END(unw_init_running) +#ifdef CONFIG_FUNCTION_TRACER +#ifdef CONFIG_DYNAMIC_FTRACE +GLOBAL_ENTRY(_mcount) + br ftrace_stub +END(_mcount) + +.here: + br.ret.sptk.many b0 + +GLOBAL_ENTRY(ftrace_caller) + alloc out0 = ar.pfs, 8, 0, 4, 0 + mov out3 = r0 + ;; + mov out2 = b0 + add r3 = 0x20, r3 + mov out1 = r1; + br.call.sptk.many b0 = ftrace_patch_gp + //this might be called from module, so we must patch gp +ftrace_patch_gp: + movl gp=__gp + mov b0 = r3 + ;; +.global ftrace_call; +ftrace_call: +{ + .mlx + nop.m 0x0 + movl r3 = .here;; +} + alloc loc0 = ar.pfs, 4, 4, 2, 0 + ;; + mov loc1 = b0 + mov out0 = b0 + mov loc2 = r8 + mov loc3 = r15 + ;; + adds out0 = -MCOUNT_INSN_SIZE, out0 + mov out1 = in2 + mov b6 = r3 + + br.call.sptk.many b0 = b6 + ;; + mov ar.pfs = loc0 + mov b0 = loc1 + mov r8 = loc2 + mov r15 = loc3 + br ftrace_stub + ;; +END(ftrace_caller) + +#else +GLOBAL_ENTRY(_mcount) + movl r2 = ftrace_stub + movl r3 = ftrace_trace_function;; + ld8 r3 = [r3];; + ld8 r3 = [r3];; + cmp.eq p7,p0 = r2, r3 +(p7) br.sptk.many ftrace_stub + ;; + + alloc loc0 = ar.pfs, 4, 4, 2, 0 + ;; + mov loc1 = b0 + mov out0 = b0 + mov loc2 = r8 + mov loc3 = r15 + ;; + adds out0 = -MCOUNT_INSN_SIZE, out0 + mov out1 = in2 + mov b6 = r3 + + br.call.sptk.many b0 = b6 + ;; + mov ar.pfs = loc0 + mov b0 = loc1 + mov r8 = loc2 + mov r15 = loc3 + br ftrace_stub + ;; +END(_mcount) +#endif + +GLOBAL_ENTRY(ftrace_stub) + mov r3 = b0 + movl r2 = _mcount_ret_helper + ;; + mov b6 = r2 + mov b7 = r3 + br.ret.sptk.many b6 + +_mcount_ret_helper: + mov b0 = r42 + mov r1 = r41 + mov ar.pfs = r40 + br b7 +END(ftrace_stub) + +#endif /* CONFIG_FUNCTION_TRACER */ + .rodata .align 8 .globl sys_call_table Index: linux-2.6-tip/arch/ia64/kernel/ftrace.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/ia64/kernel/ftrace.c @@ -0,0 +1,206 @@ +/* + * Dynamic function tracing support. + * + * Copyright (C) 2008 Shaohua Li + * + * For licencing details, see COPYING. + * + * Defines low-level handling of mcount calls when the kernel + * is compiled with the -pg flag. When using dynamic ftrace, the + * mcount call-sites get patched lazily with NOP till they are + * enabled. All code mutation routines here take effect atomically. + */ + +#include +#include + +#include +#include + +/* In IA64, each function will be added below two bundles with -pg option */ +static unsigned char __attribute__((aligned(8))) +ftrace_orig_code[MCOUNT_INSN_SIZE] = { + 0x02, 0x40, 0x31, 0x10, 0x80, 0x05, /* alloc r40=ar.pfs,12,8,0 */ + 0xb0, 0x02, 0x00, 0x00, 0x42, 0x40, /* mov r43=r0;; */ + 0x05, 0x00, 0xc4, 0x00, /* mov r42=b0 */ + 0x11, 0x48, 0x01, 0x02, 0x00, 0x21, /* mov r41=r1 */ + 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, /* nop.i 0x0 */ + 0x08, 0x00, 0x00, 0x50 /* br.call.sptk.many b0 = _mcount;; */ +}; + +struct ftrace_orig_insn { + u64 dummy1, dummy2, dummy3; + u64 dummy4:64-41+13; + u64 imm20:20; + u64 dummy5:3; + u64 sign:1; + u64 dummy6:4; +}; + +/* mcount stub will be converted below for nop */ +static unsigned char ftrace_nop_code[MCOUNT_INSN_SIZE] = { + 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, /* [MII] nop.m 0x0 */ + 0x30, 0x00, 0x00, 0x60, 0x00, 0x00, /* mov r3=ip */ + 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0 */ + 0x05, 0x00, 0x00, 0x00, 0x01, 0x00, /* [MLX] nop.m 0x0 */ + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, /* nop.x 0x0;; */ + 0x00, 0x00, 0x04, 0x00 +}; + +static unsigned char *ftrace_nop_replace(void) +{ + return ftrace_nop_code; +} + +/* + * mcount stub will be converted below for call + * Note: Just the last instruction is changed against nop + * */ +static unsigned char __attribute__((aligned(8))) +ftrace_call_code[MCOUNT_INSN_SIZE] = { + 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, /* [MII] nop.m 0x0 */ + 0x30, 0x00, 0x00, 0x60, 0x00, 0x00, /* mov r3=ip */ + 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0 */ + 0x05, 0x00, 0x00, 0x00, 0x01, 0x00, /* [MLX] nop.m 0x0 */ + 0xff, 0xff, 0xff, 0xff, 0x7f, 0x00, /* brl.many .;;*/ + 0xf8, 0xff, 0xff, 0xc8 +}; + +struct ftrace_call_insn { + u64 dummy1, dummy2; + u64 dummy3:48; + u64 imm39_l:16; + u64 imm39_h:23; + u64 dummy4:13; + u64 imm20:20; + u64 dummy5:3; + u64 i:1; + u64 dummy6:4; +}; + +static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr) +{ + struct ftrace_call_insn *code = (void *)ftrace_call_code; + unsigned long offset = addr - (ip + 0x10); + + code->imm39_l = offset >> 24; + code->imm39_h = offset >> 40; + code->imm20 = offset >> 4; + code->i = offset >> 63; + return ftrace_call_code; +} + +static int +ftrace_modify_code(unsigned long ip, unsigned char *old_code, + unsigned char *new_code, int do_check) +{ + unsigned char replaced[MCOUNT_INSN_SIZE]; + + /* + * Note: Due to modules and __init, code can + * disappear and change, we need to protect against faulting + * as well as code changing. We do this by using the + * probe_kernel_* functions. + * + * No real locking needed, this code is run through + * kstop_machine, or before SMP starts. + */ + + if (!do_check) + goto skip_check; + + /* read the text we want to modify */ + if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE)) + return -EFAULT; + + /* Make sure it is what we expect it to be */ + if (memcmp(replaced, old_code, MCOUNT_INSN_SIZE) != 0) + return -EINVAL; + +skip_check: + /* replace the text with the new text */ + if (probe_kernel_write(((void *)ip), new_code, MCOUNT_INSN_SIZE)) + return -EPERM; + flush_icache_range(ip, ip + MCOUNT_INSN_SIZE); + + return 0; +} + +static int ftrace_make_nop_check(struct dyn_ftrace *rec, unsigned long addr) +{ + unsigned char __attribute__((aligned(8))) replaced[MCOUNT_INSN_SIZE]; + unsigned long ip = rec->ip; + + if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE)) + return -EFAULT; + if (rec->flags & FTRACE_FL_CONVERTED) { + struct ftrace_call_insn *call_insn, *tmp_call; + + call_insn = (void *)ftrace_call_code; + tmp_call = (void *)replaced; + call_insn->imm39_l = tmp_call->imm39_l; + call_insn->imm39_h = tmp_call->imm39_h; + call_insn->imm20 = tmp_call->imm20; + call_insn->i = tmp_call->i; + if (memcmp(replaced, ftrace_call_code, MCOUNT_INSN_SIZE) != 0) + return -EINVAL; + return 0; + } else { + struct ftrace_orig_insn *call_insn, *tmp_call; + + call_insn = (void *)ftrace_orig_code; + tmp_call = (void *)replaced; + call_insn->sign = tmp_call->sign; + call_insn->imm20 = tmp_call->imm20; + if (memcmp(replaced, ftrace_orig_code, MCOUNT_INSN_SIZE) != 0) + return -EINVAL; + return 0; + } +} + +int ftrace_make_nop(struct module *mod, + struct dyn_ftrace *rec, unsigned long addr) +{ + int ret; + char *new; + + ret = ftrace_make_nop_check(rec, addr); + if (ret) + return ret; + new = ftrace_nop_replace(); + return ftrace_modify_code(rec->ip, NULL, new, 0); +} + +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr) +{ + unsigned long ip = rec->ip; + unsigned char *old, *new; + + old= ftrace_nop_replace(); + new = ftrace_call_replace(ip, addr); + return ftrace_modify_code(ip, old, new, 1); +} + +/* in IA64, _mcount can't directly call ftrace_stub. Only jump is ok */ +int ftrace_update_ftrace_func(ftrace_func_t func) +{ + unsigned long ip; + unsigned long addr = ((struct fnptr *)ftrace_call)->ip; + + if (func == ftrace_stub) + return 0; + ip = ((struct fnptr *)func)->ip; + + ia64_patch_imm64(addr + 2, ip); + + flush_icache_range(addr, addr + 16); + return 0; +} + +/* run from kstop_machine */ +int __init ftrace_dyn_arch_init(void *data) +{ + *(unsigned long *)data = 0; + + return 0; +} Index: linux-2.6-tip/arch/ia64/kernel/ia64_ksyms.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/ia64_ksyms.c +++ linux-2.6-tip/arch/ia64/kernel/ia64_ksyms.c @@ -112,3 +112,9 @@ EXPORT_SYMBOL_GPL(esi_call_phys); #endif extern char ia64_ivt[]; EXPORT_SYMBOL(ia64_ivt); + +#include +#ifdef CONFIG_FUNCTION_TRACER +/* mcount is defined in assembly */ +EXPORT_SYMBOL(_mcount); +#endif Index: linux-2.6-tip/arch/ia64/kernel/iosapic.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/iosapic.c +++ linux-2.6-tip/arch/ia64/kernel/iosapic.c @@ -880,7 +880,7 @@ iosapic_unregister_intr (unsigned int gs if (iosapic_intr_info[irq].count == 0) { #ifdef CONFIG_SMP /* Clear affinity */ - cpus_setall(idesc->affinity); + cpumask_setall(idesc->affinity); #endif /* Clear the interrupt information */ iosapic_intr_info[irq].dest = 0; Index: linux-2.6-tip/arch/ia64/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/irq.c +++ linux-2.6-tip/arch/ia64/kernel/irq.c @@ -80,7 +80,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) { - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); } #endif seq_printf(p, " %14s", irq_desc[i].chip->name); @@ -103,7 +103,7 @@ static char irq_redir [NR_IRQS]; // = { void set_irq_affinity_info (unsigned int irq, int hwid, int redir) { if (irq < NR_IRQS) { - cpumask_copy(&irq_desc[irq].affinity, + cpumask_copy(irq_desc[irq].affinity, cpumask_of(cpu_logical_id(hwid))); irq_redir[irq] = (char) (redir & 0xff); } @@ -148,7 +148,7 @@ static void migrate_irqs(void) if (desc->status == IRQ_PER_CPU) continue; - if (cpumask_any_and(&irq_desc[irq].affinity, cpu_online_mask) + if (cpumask_any_and(irq_desc[irq].affinity, cpu_online_mask) >= nr_cpu_ids) { /* * Save it for phase 2 processing Index: linux-2.6-tip/arch/ia64/kernel/irq_ia64.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/irq_ia64.c +++ linux-2.6-tip/arch/ia64/kernel/irq_ia64.c @@ -493,11 +493,13 @@ ia64_handle_irq (ia64_vector vector, str saved_tpr = ia64_getreg(_IA64_REG_CR_TPR); ia64_srlz_d(); while (vector != IA64_SPURIOUS_INT_VECTOR) { + struct irq_desc *desc = irq_to_desc(vector); + if (unlikely(IS_LOCAL_TLB_FLUSH(vector))) { smp_local_flush_tlb(); - kstat_this_cpu.irqs[vector]++; + kstat_incr_irqs_this_cpu(vector, desc); } else if (unlikely(IS_RESCHEDULE(vector))) - kstat_this_cpu.irqs[vector]++; + kstat_incr_irqs_this_cpu(vector, desc); else { int irq = local_vector_to_irq(vector); @@ -551,11 +553,13 @@ void ia64_process_pending_intr(void) * Perform normal interrupt style processing */ while (vector != IA64_SPURIOUS_INT_VECTOR) { + struct irq_desc *desc = irq_to_desc(vector); + if (unlikely(IS_LOCAL_TLB_FLUSH(vector))) { smp_local_flush_tlb(); - kstat_this_cpu.irqs[vector]++; + kstat_incr_irqs_this_cpu(vector, desc); } else if (unlikely(IS_RESCHEDULE(vector))) - kstat_this_cpu.irqs[vector]++; + kstat_incr_irqs_this_cpu(vector, desc); else { struct pt_regs *old_regs = set_irq_regs(NULL); int irq = local_vector_to_irq(vector); Index: linux-2.6-tip/arch/ia64/kernel/kprobes.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/kprobes.c +++ linux-2.6-tip/arch/ia64/kernel/kprobes.c @@ -870,7 +870,7 @@ static int __kprobes pre_kprobes_handler return 1; ss_probe: -#if !defined(CONFIG_PREEMPT) || defined(CONFIG_FREEZER) +#if !defined(CONFIG_PREEMPT) || defined(CONFIG_PM) if (p->ainsn.inst_flag == INST_FLAG_BOOSTABLE && !p->post_handler) { /* Boost up -- we can execute copied instructions directly */ ia64_psr(regs)->ri = p->ainsn.slot; Index: linux-2.6-tip/arch/ia64/kernel/machvec.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/machvec.c +++ linux-2.6-tip/arch/ia64/kernel/machvec.c @@ -1,5 +1,5 @@ #include - +#include #include #include @@ -75,14 +75,16 @@ machvec_timer_interrupt (int irq, void * EXPORT_SYMBOL(machvec_timer_interrupt); void -machvec_dma_sync_single (struct device *hwdev, dma_addr_t dma_handle, size_t size, int dir) +machvec_dma_sync_single(struct device *hwdev, dma_addr_t dma_handle, size_t size, + enum dma_data_direction dir) { mb(); } EXPORT_SYMBOL(machvec_dma_sync_single); void -machvec_dma_sync_sg (struct device *hwdev, struct scatterlist *sg, int n, int dir) +machvec_dma_sync_sg(struct device *hwdev, struct scatterlist *sg, int n, + enum dma_data_direction dir) { mb(); } Index: linux-2.6-tip/arch/ia64/kernel/msi_ia64.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/msi_ia64.c +++ linux-2.6-tip/arch/ia64/kernel/msi_ia64.c @@ -75,7 +75,7 @@ static void ia64_set_msi_irq_affinity(un msg.data = data; write_msi_msg(irq, &msg); - irq_desc[irq].affinity = cpumask_of_cpu(cpu); + cpumask_copy(irq_desc[irq].affinity, cpumask_of(cpu)); } #endif /* CONFIG_SMP */ @@ -187,7 +187,7 @@ static void dmar_msi_set_affinity(unsign msg.address_lo |= MSI_ADDR_DESTID_CPU(cpu_physical_id(cpu)); dmar_msi_write(irq, &msg); - irq_desc[irq].affinity = *mask; + cpumask_copy(irq_desc[irq].affinity, mask); } #endif /* CONFIG_SMP */ Index: linux-2.6-tip/arch/ia64/kernel/pci-dma.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/pci-dma.c +++ linux-2.6-tip/arch/ia64/kernel/pci-dma.c @@ -32,9 +32,6 @@ int force_iommu __read_mostly = 1; int force_iommu __read_mostly; #endif -/* Set this to 1 if there is a HW IOMMU in the system */ -int iommu_detected __read_mostly; - /* Dummy device used for NULL arguments (normally ISA). Better would be probably a smaller DMA mask, but this is bug-to-bug compatible to i386. */ @@ -44,18 +41,7 @@ struct device fallback_dev = { .dma_mask = &fallback_dev.coherent_dma_mask, }; -void __init pci_iommu_alloc(void) -{ - /* - * The order of these functions is important for - * fall-back/fail-over reasons - */ - detect_intel_iommu(); - -#ifdef CONFIG_SWIOTLB - pci_swiotlb_init(); -#endif -} +extern struct dma_map_ops intel_dma_ops; static int __init pci_iommu_init(void) { @@ -79,15 +65,12 @@ iommu_dma_init(void) return; } -struct dma_mapping_ops *dma_ops; -EXPORT_SYMBOL(dma_ops); - int iommu_dma_supported(struct device *dev, u64 mask) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = platform_dma_get_ops(dev); - if (ops->dma_supported_op) - return ops->dma_supported_op(dev, mask); + if (ops->dma_supported) + return ops->dma_supported(dev, mask); /* Copied from i386. Doesn't make much sense, because it will only work for pci_alloc_coherent. @@ -116,4 +99,25 @@ int iommu_dma_supported(struct device *d } EXPORT_SYMBOL(iommu_dma_supported); +void __init pci_iommu_alloc(void) +{ + dma_ops = &intel_dma_ops; + + dma_ops->sync_single_for_cpu = machvec_dma_sync_single; + dma_ops->sync_sg_for_cpu = machvec_dma_sync_sg; + dma_ops->sync_single_for_device = machvec_dma_sync_single; + dma_ops->sync_sg_for_device = machvec_dma_sync_sg; + dma_ops->dma_supported = iommu_dma_supported; + + /* + * The order of these functions is important for + * fall-back/fail-over reasons + */ + detect_intel_iommu(); + +#ifdef CONFIG_SWIOTLB + pci_swiotlb_init(); +#endif +} + #endif Index: linux-2.6-tip/arch/ia64/kernel/pci-swiotlb.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/pci-swiotlb.c +++ linux-2.6-tip/arch/ia64/kernel/pci-swiotlb.c @@ -13,23 +13,37 @@ int swiotlb __read_mostly; EXPORT_SYMBOL(swiotlb); -struct dma_mapping_ops swiotlb_dma_ops = { - .mapping_error = swiotlb_dma_mapping_error, - .alloc_coherent = swiotlb_alloc_coherent, +static void *ia64_swiotlb_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp) +{ + if (dev->coherent_dma_mask != DMA_64BIT_MASK) + gfp |= GFP_DMA; + return swiotlb_alloc_coherent(dev, size, dma_handle, gfp); +} + +struct dma_map_ops swiotlb_dma_ops = { + .alloc_coherent = ia64_swiotlb_alloc_coherent, .free_coherent = swiotlb_free_coherent, - .map_single = swiotlb_map_single, - .unmap_single = swiotlb_unmap_single, + .map_page = swiotlb_map_page, + .unmap_page = swiotlb_unmap_page, + .map_sg = swiotlb_map_sg_attrs, + .unmap_sg = swiotlb_unmap_sg_attrs, .sync_single_for_cpu = swiotlb_sync_single_for_cpu, .sync_single_for_device = swiotlb_sync_single_for_device, .sync_single_range_for_cpu = swiotlb_sync_single_range_for_cpu, .sync_single_range_for_device = swiotlb_sync_single_range_for_device, .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = swiotlb_sync_sg_for_device, - .map_sg = swiotlb_map_sg, - .unmap_sg = swiotlb_unmap_sg, - .dma_supported_op = swiotlb_dma_supported, + .dma_supported = swiotlb_dma_supported, + .mapping_error = swiotlb_dma_mapping_error, }; +void __init swiotlb_dma_init(void) +{ + dma_ops = &swiotlb_dma_ops; + swiotlb_init(); +} + void __init pci_swiotlb_init(void) { if (!iommu_detected) { Index: linux-2.6-tip/arch/ia64/kernel/vmlinux.lds.S =================================================================== --- linux-2.6-tip.orig/arch/ia64/kernel/vmlinux.lds.S +++ linux-2.6-tip/arch/ia64/kernel/vmlinux.lds.S @@ -219,6 +219,7 @@ SECTIONS .data.percpu PERCPU_ADDR : AT(__phys_per_cpu_start - LOAD_OFFSET) { __per_cpu_start = .; + *(.data.percpu.page_aligned) *(.data.percpu) *(.data.percpu.shared_aligned) __per_cpu_end = .; Index: linux-2.6-tip/arch/ia64/sn/kernel/msi_sn.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/sn/kernel/msi_sn.c +++ linux-2.6-tip/arch/ia64/sn/kernel/msi_sn.c @@ -205,7 +205,7 @@ static void sn_set_msi_irq_affinity(unsi msg.address_lo = (u32)(bus_addr & 0x00000000ffffffff); write_msi_msg(irq, &msg); - irq_desc[irq].affinity = *cpu_mask; + cpumask_copy(irq_desc[irq].affinity, cpu_mask); } #endif /* CONFIG_SMP */ Index: linux-2.6-tip/arch/ia64/sn/pci/pci_dma.c =================================================================== --- linux-2.6-tip.orig/arch/ia64/sn/pci/pci_dma.c +++ linux-2.6-tip/arch/ia64/sn/pci/pci_dma.c @@ -10,7 +10,7 @@ */ #include -#include +#include #include #include #include @@ -31,7 +31,7 @@ * this function. Of course, SN only supports devices that have 32 or more * address bits when using the PMU. */ -int sn_dma_supported(struct device *dev, u64 mask) +static int sn_dma_supported(struct device *dev, u64 mask) { BUG_ON(dev->bus != &pci_bus_type); @@ -39,7 +39,6 @@ int sn_dma_supported(struct device *dev, return 0; return 1; } -EXPORT_SYMBOL(sn_dma_supported); /** * sn_dma_set_mask - set the DMA mask @@ -75,8 +74,8 @@ EXPORT_SYMBOL(sn_dma_set_mask); * queue for a SCSI controller). See Documentation/DMA-API.txt for * more information. */ -void *sn_dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t * dma_handle, gfp_t flags) +static void *sn_dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t * dma_handle, gfp_t flags) { void *cpuaddr; unsigned long phys_addr; @@ -124,7 +123,6 @@ void *sn_dma_alloc_coherent(struct devic return cpuaddr; } -EXPORT_SYMBOL(sn_dma_alloc_coherent); /** * sn_pci_free_coherent - free memory associated with coherent DMAable region @@ -136,8 +134,8 @@ EXPORT_SYMBOL(sn_dma_alloc_coherent); * Frees the memory allocated by dma_alloc_coherent(), potentially unmapping * any associated IOMMU mappings. */ -void sn_dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle) +static void sn_dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) { struct pci_dev *pdev = to_pci_dev(dev); struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev); @@ -147,7 +145,6 @@ void sn_dma_free_coherent(struct device provider->dma_unmap(pdev, dma_handle, 0); free_pages((unsigned long)cpu_addr, get_order(size)); } -EXPORT_SYMBOL(sn_dma_free_coherent); /** * sn_dma_map_single_attrs - map a single page for DMA @@ -173,10 +170,12 @@ EXPORT_SYMBOL(sn_dma_free_coherent); * TODO: simplify our interface; * figure out how to save dmamap handle so can use two step. */ -dma_addr_t sn_dma_map_single_attrs(struct device *dev, void *cpu_addr, - size_t size, int direction, - struct dma_attrs *attrs) +static dma_addr_t sn_dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { + void *cpu_addr = page_address(page) + offset; dma_addr_t dma_addr; unsigned long phys_addr; struct pci_dev *pdev = to_pci_dev(dev); @@ -201,7 +200,6 @@ dma_addr_t sn_dma_map_single_attrs(struc } return dma_addr; } -EXPORT_SYMBOL(sn_dma_map_single_attrs); /** * sn_dma_unmap_single_attrs - unamp a DMA mapped page @@ -215,21 +213,20 @@ EXPORT_SYMBOL(sn_dma_map_single_attrs); * by @dma_handle into the coherence domain. On SN, we're always cache * coherent, so we just need to free any ATEs associated with this mapping. */ -void sn_dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, - size_t size, int direction, - struct dma_attrs *attrs) +static void sn_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct pci_dev *pdev = to_pci_dev(dev); struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev); BUG_ON(dev->bus != &pci_bus_type); - provider->dma_unmap(pdev, dma_addr, direction); + provider->dma_unmap(pdev, dma_addr, dir); } -EXPORT_SYMBOL(sn_dma_unmap_single_attrs); /** - * sn_dma_unmap_sg_attrs - unmap a DMA scatterlist + * sn_dma_unmap_sg - unmap a DMA scatterlist * @dev: device to unmap * @sg: scatterlist to unmap * @nhwentries: number of scatterlist entries @@ -238,9 +235,9 @@ EXPORT_SYMBOL(sn_dma_unmap_single_attrs) * * Unmap a set of streaming mode DMA translations. */ -void sn_dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, - int nhwentries, int direction, - struct dma_attrs *attrs) +static void sn_dma_unmap_sg(struct device *dev, struct scatterlist *sgl, + int nhwentries, enum dma_data_direction dir, + struct dma_attrs *attrs) { int i; struct pci_dev *pdev = to_pci_dev(dev); @@ -250,15 +247,14 @@ void sn_dma_unmap_sg_attrs(struct device BUG_ON(dev->bus != &pci_bus_type); for_each_sg(sgl, sg, nhwentries, i) { - provider->dma_unmap(pdev, sg->dma_address, direction); + provider->dma_unmap(pdev, sg->dma_address, dir); sg->dma_address = (dma_addr_t) NULL; sg->dma_length = 0; } } -EXPORT_SYMBOL(sn_dma_unmap_sg_attrs); /** - * sn_dma_map_sg_attrs - map a scatterlist for DMA + * sn_dma_map_sg - map a scatterlist for DMA * @dev: device to map for * @sg: scatterlist to map * @nhwentries: number of entries @@ -272,8 +268,9 @@ EXPORT_SYMBOL(sn_dma_unmap_sg_attrs); * * Maps each entry of @sg for DMA. */ -int sn_dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, - int nhwentries, int direction, struct dma_attrs *attrs) +static int sn_dma_map_sg(struct device *dev, struct scatterlist *sgl, + int nhwentries, enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long phys_addr; struct scatterlist *saved_sg = sgl, *sg; @@ -310,8 +307,7 @@ int sn_dma_map_sg_attrs(struct device *d * Free any successfully allocated entries. */ if (i > 0) - sn_dma_unmap_sg_attrs(dev, saved_sg, i, - direction, attrs); + sn_dma_unmap_sg(dev, saved_sg, i, dir, attrs); return 0; } @@ -320,41 +316,36 @@ int sn_dma_map_sg_attrs(struct device *d return nhwentries; } -EXPORT_SYMBOL(sn_dma_map_sg_attrs); -void sn_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, - size_t size, int direction) +static void sn_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir) { BUG_ON(dev->bus != &pci_bus_type); } -EXPORT_SYMBOL(sn_dma_sync_single_for_cpu); -void sn_dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, - size_t size, int direction) +static void sn_dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, + size_t size, + enum dma_data_direction dir) { BUG_ON(dev->bus != &pci_bus_type); } -EXPORT_SYMBOL(sn_dma_sync_single_for_device); -void sn_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, - int nelems, int direction) +static void sn_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, + int nelems, enum dma_data_direction dir) { BUG_ON(dev->bus != &pci_bus_type); } -EXPORT_SYMBOL(sn_dma_sync_sg_for_cpu); -void sn_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, - int nelems, int direction) +static void sn_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, + int nelems, enum dma_data_direction dir) { BUG_ON(dev->bus != &pci_bus_type); } -EXPORT_SYMBOL(sn_dma_sync_sg_for_device); -int sn_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) +static int sn_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) { return 0; } -EXPORT_SYMBOL(sn_dma_mapping_error); u64 sn_dma_get_required_mask(struct device *dev) { @@ -471,3 +462,23 @@ int sn_pci_legacy_write(struct pci_bus * out: return ret; } + +static struct dma_map_ops sn_dma_ops = { + .alloc_coherent = sn_dma_alloc_coherent, + .free_coherent = sn_dma_free_coherent, + .map_page = sn_dma_map_page, + .unmap_page = sn_dma_unmap_page, + .map_sg = sn_dma_map_sg, + .unmap_sg = sn_dma_unmap_sg, + .sync_single_for_cpu = sn_dma_sync_single_for_cpu, + .sync_sg_for_cpu = sn_dma_sync_sg_for_cpu, + .sync_single_for_device = sn_dma_sync_single_for_device, + .sync_sg_for_device = sn_dma_sync_sg_for_device, + .mapping_error = sn_dma_mapping_error, + .dma_supported = sn_dma_supported, +}; + +void sn_dma_init(void) +{ + dma_ops = &sn_dma_ops; +} Index: linux-2.6-tip/arch/m32r/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/m32r/kernel/irq.c +++ linux-2.6-tip/arch/m32r/kernel/irq.c @@ -49,7 +49,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/arch/mips/include/asm/irq.h =================================================================== --- linux-2.6-tip.orig/arch/mips/include/asm/irq.h +++ linux-2.6-tip/arch/mips/include/asm/irq.h @@ -66,7 +66,7 @@ extern void smtc_forward_irq(unsigned in */ #define IRQ_AFFINITY_HOOK(irq) \ do { \ - if (!cpu_isset(smp_processor_id(), irq_desc[irq].affinity)) { \ + if (!cpumask_test_cpu(smp_processor_id(), irq_desc[irq].affinity)) {\ smtc_forward_irq(irq); \ irq_exit(); \ return; \ Index: linux-2.6-tip/arch/mips/include/asm/sigcontext.h =================================================================== --- linux-2.6-tip.orig/arch/mips/include/asm/sigcontext.h +++ linux-2.6-tip/arch/mips/include/asm/sigcontext.h @@ -9,6 +9,7 @@ #ifndef _ASM_SIGCONTEXT_H #define _ASM_SIGCONTEXT_H +#include #include #if _MIPS_SIM == _MIPS_SIM_ABI32 Index: linux-2.6-tip/arch/mips/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/mips/include/asm/swab.h +++ linux-2.6-tip/arch/mips/include/asm/swab.h @@ -9,7 +9,7 @@ #define _ASM_SWAB_H #include -#include +#include #define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/mips/kernel/irq-gic.c =================================================================== --- linux-2.6-tip.orig/arch/mips/kernel/irq-gic.c +++ linux-2.6-tip/arch/mips/kernel/irq-gic.c @@ -187,7 +187,7 @@ static void gic_set_affinity(unsigned in set_bit(irq, pcpu_masks[first_cpu(tmp)].pcpu_mask); } - irq_desc[irq].affinity = *cpumask; + cpumask_copy(irq_desc[irq].affinity, cpumask); spin_unlock_irqrestore(&gic_lock, flags); } Index: linux-2.6-tip/arch/mips/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/mips/kernel/irq.c +++ linux-2.6-tip/arch/mips/kernel/irq.c @@ -108,7 +108,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); Index: linux-2.6-tip/arch/mips/kernel/smtc.c =================================================================== --- linux-2.6-tip.orig/arch/mips/kernel/smtc.c +++ linux-2.6-tip/arch/mips/kernel/smtc.c @@ -686,7 +686,7 @@ void smtc_forward_irq(unsigned int irq) * and efficiency, we just pick the easiest one to find. */ - target = first_cpu(irq_desc[irq].affinity); + target = cpumask_first(irq_desc[irq].affinity); /* * We depend on the platform code to have correctly processed @@ -921,11 +921,13 @@ void ipi_decode(struct smtc_ipi *pipi) struct clock_event_device *cd; void *arg_copy = pipi->arg; int type_copy = pipi->type; + int irq = MIPS_CPU_IRQ_BASE + 1; + smtc_ipi_nq(&freeIPIq, pipi); switch (type_copy) { case SMTC_CLOCK_TICK: irq_enter(); - kstat_this_cpu.irqs[MIPS_CPU_IRQ_BASE + 1]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); cd = &per_cpu(mips_clockevent_device, cpu); cd->event_handler(cd); irq_exit(); Index: linux-2.6-tip/arch/mips/mti-malta/malta-smtc.c =================================================================== --- linux-2.6-tip.orig/arch/mips/mti-malta/malta-smtc.c +++ linux-2.6-tip/arch/mips/mti-malta/malta-smtc.c @@ -116,7 +116,7 @@ struct plat_smp_ops msmtc_smp_ops = { void plat_set_irq_affinity(unsigned int irq, const struct cpumask *affinity) { - cpumask_t tmask = *affinity; + cpumask_t tmask; int cpu = 0; void smtc_set_irq_affinity(unsigned int irq, cpumask_t aff); @@ -139,11 +139,12 @@ void plat_set_irq_affinity(unsigned int * be made to forward to an offline "CPU". */ + cpumask_copy(&tmask, affinity); for_each_cpu(cpu, affinity) { if ((cpu_data[cpu].vpe_id != 0) || !cpu_online(cpu)) cpu_clear(cpu, tmask); } - irq_desc[irq].affinity = tmask; + cpumask_copy(irq_desc[irq].affinity, &tmask); if (cpus_empty(tmask)) /* Index: linux-2.6-tip/arch/mips/sgi-ip22/ip22-int.c =================================================================== --- linux-2.6-tip.orig/arch/mips/sgi-ip22/ip22-int.c +++ linux-2.6-tip/arch/mips/sgi-ip22/ip22-int.c @@ -155,7 +155,7 @@ static void indy_buserror_irq(void) int irq = SGI_BUSERR_IRQ; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); ip22_be_interrupt(irq); irq_exit(); } Index: linux-2.6-tip/arch/mips/sgi-ip22/ip22-time.c =================================================================== --- linux-2.6-tip.orig/arch/mips/sgi-ip22/ip22-time.c +++ linux-2.6-tip/arch/mips/sgi-ip22/ip22-time.c @@ -122,7 +122,7 @@ void indy_8254timer_irq(void) char c; irq_enter(); - kstat_this_cpu.irqs[irq]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); printk(KERN_ALERT "Oops, got 8254 interrupt.\n"); ArcRead(0, &c, 1, &cnt); ArcEnterInteractiveMode(); Index: linux-2.6-tip/arch/mips/sibyte/bcm1480/smp.c =================================================================== --- linux-2.6-tip.orig/arch/mips/sibyte/bcm1480/smp.c +++ linux-2.6-tip/arch/mips/sibyte/bcm1480/smp.c @@ -178,9 +178,10 @@ struct plat_smp_ops bcm1480_smp_ops = { void bcm1480_mailbox_interrupt(void) { int cpu = smp_processor_id(); + int irq = K_BCM1480_INT_MBOX_0_0; unsigned int action; - kstat_this_cpu.irqs[K_BCM1480_INT_MBOX_0_0]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); /* Load the mailbox register to figure out what we're supposed to do */ action = (__raw_readq(mailbox_0_regs[cpu]) >> 48) & 0xffff; Index: linux-2.6-tip/arch/mips/sibyte/sb1250/smp.c =================================================================== --- linux-2.6-tip.orig/arch/mips/sibyte/sb1250/smp.c +++ linux-2.6-tip/arch/mips/sibyte/sb1250/smp.c @@ -166,9 +166,10 @@ struct plat_smp_ops sb_smp_ops = { void sb1250_mailbox_interrupt(void) { int cpu = smp_processor_id(); + int irq = K_INT_MBOX_0; unsigned int action; - kstat_this_cpu.irqs[K_INT_MBOX_0]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); /* Load the mailbox register to figure out what we're supposed to do */ action = (____raw_readq(mailbox_regs[cpu]) >> 48) & 0xffff; Index: linux-2.6-tip/arch/mn10300/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/mn10300/kernel/irq.c +++ linux-2.6-tip/arch/mn10300/kernel/irq.c @@ -221,7 +221,7 @@ int show_interrupts(struct seq_file *p, if (action) { seq_printf(p, "%3d: ", i); for_each_present_cpu(cpu) - seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu)); seq_printf(p, " %14s.%u", irq_desc[i].chip->name, (GxICR(i) & GxICR_LEVEL) >> GxICR_LEVEL_SHIFT); Index: linux-2.6-tip/arch/mn10300/kernel/mn10300-watchdog.c =================================================================== --- linux-2.6-tip.orig/arch/mn10300/kernel/mn10300-watchdog.c +++ linux-2.6-tip/arch/mn10300/kernel/mn10300-watchdog.c @@ -130,6 +130,7 @@ void watchdog_interrupt(struct pt_regs * * the stack NMI-atomically, it's safe to use smp_processor_id(). */ int sum, cpu = smp_processor_id(); + int irq = NMIIRQ; u8 wdt, tmp; wdt = WDCTR & ~WDCTR_WDCNE; @@ -138,7 +139,7 @@ void watchdog_interrupt(struct pt_regs * NMICR = NMICR_WDIF; nmi_count(cpu)++; - kstat_this_cpu.irqs[NMIIRQ]++; + kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); sum = irq_stat[cpu].__irq_count; if (last_irq_sums[cpu] == sum) { Index: linux-2.6-tip/arch/parisc/include/asm/pdc.h =================================================================== --- linux-2.6-tip.orig/arch/parisc/include/asm/pdc.h +++ linux-2.6-tip/arch/parisc/include/asm/pdc.h @@ -336,10 +336,11 @@ #define NUM_PDC_RESULT 32 #if !defined(__ASSEMBLY__) -#ifdef __KERNEL__ #include +#ifdef __KERNEL__ + extern int pdc_type; /* Values for pdc_type */ Index: linux-2.6-tip/arch/parisc/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/parisc/include/asm/swab.h +++ linux-2.6-tip/arch/parisc/include/asm/swab.h @@ -1,7 +1,7 @@ #ifndef _PARISC_SWAB_H #define _PARISC_SWAB_H -#include +#include #include #define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/parisc/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/parisc/kernel/irq.c +++ linux-2.6-tip/arch/parisc/kernel/irq.c @@ -120,7 +120,7 @@ int cpu_check_affinity(unsigned int irq, if (CHECK_IRQ_PER_CPU(irq)) { /* Bad linux design decision. The mask has already * been set; we must reset it */ - irq_desc[irq].affinity = CPU_MASK_ALL; + cpumask_setall(irq_desc[irq].affinity); return -EINVAL; } @@ -136,7 +136,7 @@ static void cpu_set_affinity_irq(unsigne if (cpu_check_affinity(irq, dest)) return; - irq_desc[irq].affinity = *dest; + cpumask_copy(irq_desc[irq].affinity, dest); } #endif @@ -183,7 +183,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%3d: ", i); #ifdef CONFIG_SMP for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #else seq_printf(p, "%10u ", kstat_irqs(i)); #endif @@ -295,7 +295,7 @@ int txn_alloc_irq(unsigned int bits_wide unsigned long txn_affinity_addr(unsigned int irq, int cpu) { #ifdef CONFIG_SMP - irq_desc[irq].affinity = cpumask_of_cpu(cpu); + cpumask_copy(irq_desc[irq].affinity, cpumask_of(cpu)); #endif return per_cpu(cpu_data, cpu).txn_addr; @@ -352,7 +352,7 @@ void do_cpu_irq_mask(struct pt_regs *reg irq = eirr_to_irq(eirr_val); #ifdef CONFIG_SMP - dest = irq_desc[irq].affinity; + cpumask_copy(&dest, irq_desc[irq].affinity); if (CHECK_IRQ_PER_CPU(irq_desc[irq].status) && !cpu_isset(smp_processor_id(), dest)) { int cpu = first_cpu(dest); Index: linux-2.6-tip/arch/powerpc/include/asm/bootx.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/bootx.h +++ linux-2.6-tip/arch/powerpc/include/asm/bootx.h @@ -9,7 +9,7 @@ #ifndef __ASM_BOOTX_H__ #define __ASM_BOOTX_H__ -#include +#include #ifdef macintosh #include Index: linux-2.6-tip/arch/powerpc/include/asm/elf.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/elf.h +++ linux-2.6-tip/arch/powerpc/include/asm/elf.h @@ -7,7 +7,7 @@ #include #endif -#include +#include #include #include #include Index: linux-2.6-tip/arch/powerpc/include/asm/hw_irq.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/hw_irq.h +++ linux-2.6-tip/arch/powerpc/include/asm/hw_irq.h @@ -131,5 +131,36 @@ static inline int irqs_disabled_flags(un */ struct hw_interrupt_type; +#ifdef CONFIG_PERF_COUNTERS +static inline unsigned long get_perf_counter_pending(void) +{ + unsigned long x; + + asm volatile("lbz %0,%1(13)" + : "=r" (x) + : "i" (offsetof(struct paca_struct, perf_counter_pending))); + return x; +} + +static inline void set_perf_counter_pending(int x) +{ + asm volatile("stb %0,%1(13)" : : + "r" (x), + "i" (offsetof(struct paca_struct, perf_counter_pending))); +} + +extern void perf_counter_do_pending(void); + +#else + +static inline unsigned long get_perf_counter_pending(void) +{ + return 0; +} + +static inline void set_perf_counter_pending(int x) {} +static inline void perf_counter_do_pending(void) {} +#endif /* CONFIG_PERF_COUNTERS */ + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_HW_IRQ_H */ Index: linux-2.6-tip/arch/powerpc/include/asm/kvm.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/kvm.h +++ linux-2.6-tip/arch/powerpc/include/asm/kvm.h @@ -20,7 +20,7 @@ #ifndef __LINUX_KVM_POWERPC_H #define __LINUX_KVM_POWERPC_H -#include +#include struct kvm_regs { __u64 pc; Index: linux-2.6-tip/arch/powerpc/include/asm/paca.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/paca.h +++ linux-2.6-tip/arch/powerpc/include/asm/paca.h @@ -99,6 +99,7 @@ struct paca_struct { u8 soft_enabled; /* irq soft-enable flag */ u8 hard_enabled; /* set if irqs are enabled in MSR */ u8 io_sync; /* writel() needs spin_unlock sync */ + u8 perf_counter_pending; /* PM interrupt while soft-disabled */ /* Stuff for accurate time accounting */ u64 user_time; /* accumulated usermode TB ticks */ Index: linux-2.6-tip/arch/powerpc/include/asm/perf_counter.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/powerpc/include/asm/perf_counter.h @@ -0,0 +1,72 @@ +/* + * Performance counter support - PowerPC-specific definitions. + * + * Copyright 2008-2009 Paul Mackerras, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include + +#define MAX_HWCOUNTERS 8 +#define MAX_EVENT_ALTERNATIVES 8 + +/* + * This struct provides the constants and functions needed to + * describe the PMU on a particular POWER-family CPU. + */ +struct power_pmu { + int n_counter; + int max_alternatives; + u64 add_fields; + u64 test_adder; + int (*compute_mmcr)(unsigned int events[], int n_ev, + unsigned int hwc[], u64 mmcr[]); + int (*get_constraint)(unsigned int event, u64 *mskp, u64 *valp); + int (*get_alternatives)(unsigned int event, unsigned int alt[]); + void (*disable_pmc)(unsigned int pmc, u64 mmcr[]); + int n_generic; + int *generic_events; +}; + +extern struct power_pmu *ppmu; + +/* + * The power_pmu.get_constraint function returns a 64-bit value and + * a 64-bit mask that express the constraints between this event and + * other events. + * + * The value and mask are divided up into (non-overlapping) bitfields + * of three different types: + * + * Select field: this expresses the constraint that some set of bits + * in MMCR* needs to be set to a specific value for this event. For a + * select field, the mask contains 1s in every bit of the field, and + * the value contains a unique value for each possible setting of the + * MMCR* bits. The constraint checking code will ensure that two events + * that set the same field in their masks have the same value in their + * value dwords. + * + * Add field: this expresses the constraint that there can be at most + * N events in a particular class. A field of k bits can be used for + * N <= 2^(k-1) - 1. The mask has the most significant bit of the field + * set (and the other bits 0), and the value has only the least significant + * bit of the field set. In addition, the 'add_fields' and 'test_adder' + * in the struct power_pmu for this processor come into play. The + * add_fields value contains 1 in the LSB of the field, and the + * test_adder contains 2^(k-1) - 1 - N in the field. + * + * NAND field: this expresses the constraint that you may not have events + * in all of a set of classes. (For example, on PPC970, you can't select + * events from the FPU, ISU and IDU simultaneously, although any two are + * possible.) For N classes, the field is N+1 bits wide, and each class + * is assigned one bit from the least-significant N bits. The mask has + * only the most-significant bit set, and the value has only the bit + * for the event's class set. The test_adder has the least significant + * bit set in the field. + * + * If an event is not subject to the constraint expressed by a particular + * field, then it will have 0 in both the mask and value for that field. + */ Index: linux-2.6-tip/arch/powerpc/include/asm/ps3fb.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/ps3fb.h +++ linux-2.6-tip/arch/powerpc/include/asm/ps3fb.h @@ -19,6 +19,7 @@ #ifndef _ASM_POWERPC_PS3FB_H_ #define _ASM_POWERPC_PS3FB_H_ +#include #include /* ioctl */ Index: linux-2.6-tip/arch/powerpc/include/asm/spu_info.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/spu_info.h +++ linux-2.6-tip/arch/powerpc/include/asm/spu_info.h @@ -23,9 +23,10 @@ #ifndef _SPU_INFO_H #define _SPU_INFO_H +#include + #ifdef __KERNEL__ #include -#include #else struct mfc_cq_sr { __u64 mfc_cq_data0_RW; Index: linux-2.6-tip/arch/powerpc/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/swab.h +++ linux-2.6-tip/arch/powerpc/include/asm/swab.h @@ -8,7 +8,7 @@ * 2 of the License, or (at your option) any later version. */ -#include +#include #include #ifdef __GNUC__ Index: linux-2.6-tip/arch/powerpc/include/asm/systbl.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/systbl.h +++ linux-2.6-tip/arch/powerpc/include/asm/systbl.h @@ -322,3 +322,4 @@ SYSCALL_SPU(epoll_create1) SYSCALL_SPU(dup3) SYSCALL_SPU(pipe2) SYSCALL(inotify_init1) +SYSCALL(perf_counter_open) Index: linux-2.6-tip/arch/powerpc/include/asm/unistd.h =================================================================== --- linux-2.6-tip.orig/arch/powerpc/include/asm/unistd.h +++ linux-2.6-tip/arch/powerpc/include/asm/unistd.h @@ -341,10 +341,11 @@ #define __NR_dup3 316 #define __NR_pipe2 317 #define __NR_inotify_init1 318 +#define __NR_perf_counter_open 319 #ifdef __KERNEL__ -#define __NR_syscalls 319 +#define __NR_syscalls 320 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls Index: linux-2.6-tip/arch/powerpc/kernel/Makefile =================================================================== --- linux-2.6-tip.orig/arch/powerpc/kernel/Makefile +++ linux-2.6-tip/arch/powerpc/kernel/Makefile @@ -94,6 +94,7 @@ obj-$(CONFIG_AUDIT) += audit.o obj64-$(CONFIG_AUDIT) += compat_audit.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o +obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o ppc970-pmu.o power6-pmu.o obj-$(CONFIG_8XX_MINIMAL_FPEMU) += softemu8xx.o Index: linux-2.6-tip/arch/powerpc/kernel/asm-offsets.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/kernel/asm-offsets.c +++ linux-2.6-tip/arch/powerpc/kernel/asm-offsets.c @@ -131,6 +131,7 @@ int main(void) DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr)); DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled)); DEFINE(PACAHARDIRQEN, offsetof(struct paca_struct, hard_enabled)); + DEFINE(PACAPERFPEND, offsetof(struct paca_struct, perf_counter_pending)); DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache)); DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); Index: linux-2.6-tip/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6-tip.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6-tip/arch/powerpc/kernel/entry_64.S @@ -526,6 +526,15 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ 2: TRACE_AND_RESTORE_IRQ(r5); +#ifdef CONFIG_PERF_COUNTERS + /* check paca->perf_counter_pending if we're enabling ints */ + lbz r3,PACAPERFPEND(r13) + and. r3,r3,r5 + beq 27f + bl .perf_counter_do_pending +27: +#endif /* CONFIG_PERF_COUNTERS */ + /* extract EE bit and use it to restore paca->hard_enabled */ ld r3,_MSR(r1) rldicl r4,r3,49,63 /* r0 = (r3 >> 15) & 1 */ Index: linux-2.6-tip/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/kernel/irq.c +++ linux-2.6-tip/arch/powerpc/kernel/irq.c @@ -104,6 +104,13 @@ static inline notrace void set_soft_enab : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled))); } +#ifdef CONFIG_PERF_COUNTERS +notrace void __weak perf_counter_do_pending(void) +{ + set_perf_counter_pending(0); +} +#endif + notrace void raw_local_irq_restore(unsigned long en) { /* @@ -135,6 +142,9 @@ notrace void raw_local_irq_restore(unsig iseries_handle_interrupts(); } + if (get_perf_counter_pending()) + perf_counter_do_pending(); + /* * if (get_paca()->hard_enabled) return; * But again we need to take care that gcc gets hard_enabled directly @@ -190,7 +200,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%3d: ", i); #ifdef CONFIG_SMP for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #else seq_printf(p, "%10u ", kstat_irqs(i)); #endif /* CONFIG_SMP */ @@ -231,7 +241,7 @@ void fixup_irqs(cpumask_t map) if (irq_desc[irq].status & IRQ_PER_CPU) continue; - cpus_and(mask, irq_desc[irq].affinity, map); + cpumask_and(&mask, irq_desc[irq].affinity, &map); if (any_online_cpu(mask) == NR_CPUS) { printk("Breaking affinity for irq %i\n", irq); mask = map; Index: linux-2.6-tip/arch/powerpc/kernel/perf_counter.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/powerpc/kernel/perf_counter.c @@ -0,0 +1,847 @@ +/* + * Performance counter support - powerpc architecture code + * + * Copyright 2008-2009 Paul Mackerras, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct cpu_hw_counters { + int n_counters; + int n_percpu; + int disabled; + int n_added; + struct perf_counter *counter[MAX_HWCOUNTERS]; + unsigned int events[MAX_HWCOUNTERS]; + u64 mmcr[3]; + u8 pmcs_enabled; +}; +DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters); + +struct power_pmu *ppmu; + +void perf_counter_print_debug(void) +{ +} + +/* + * Read one performance monitor counter (PMC). + */ +static unsigned long read_pmc(int idx) +{ + unsigned long val; + + switch (idx) { + case 1: + val = mfspr(SPRN_PMC1); + break; + case 2: + val = mfspr(SPRN_PMC2); + break; + case 3: + val = mfspr(SPRN_PMC3); + break; + case 4: + val = mfspr(SPRN_PMC4); + break; + case 5: + val = mfspr(SPRN_PMC5); + break; + case 6: + val = mfspr(SPRN_PMC6); + break; + case 7: + val = mfspr(SPRN_PMC7); + break; + case 8: + val = mfspr(SPRN_PMC8); + break; + default: + printk(KERN_ERR "oops trying to read PMC%d\n", idx); + val = 0; + } + return val; +} + +/* + * Write one PMC. + */ +static void write_pmc(int idx, unsigned long val) +{ + switch (idx) { + case 1: + mtspr(SPRN_PMC1, val); + break; + case 2: + mtspr(SPRN_PMC2, val); + break; + case 3: + mtspr(SPRN_PMC3, val); + break; + case 4: + mtspr(SPRN_PMC4, val); + break; + case 5: + mtspr(SPRN_PMC5, val); + break; + case 6: + mtspr(SPRN_PMC6, val); + break; + case 7: + mtspr(SPRN_PMC7, val); + break; + case 8: + mtspr(SPRN_PMC8, val); + break; + default: + printk(KERN_ERR "oops trying to write PMC%d\n", idx); + } +} + +/* + * Check if a set of events can all go on the PMU at once. + * If they can't, this will look at alternative codes for the events + * and see if any combination of alternative codes is feasible. + * The feasible set is returned in event[]. + */ +static int power_check_constraints(unsigned int event[], int n_ev) +{ + u64 mask, value, nv; + unsigned int alternatives[MAX_HWCOUNTERS][MAX_EVENT_ALTERNATIVES]; + u64 amasks[MAX_HWCOUNTERS][MAX_EVENT_ALTERNATIVES]; + u64 avalues[MAX_HWCOUNTERS][MAX_EVENT_ALTERNATIVES]; + u64 smasks[MAX_HWCOUNTERS], svalues[MAX_HWCOUNTERS]; + int n_alt[MAX_HWCOUNTERS], choice[MAX_HWCOUNTERS]; + int i, j; + u64 addf = ppmu->add_fields; + u64 tadd = ppmu->test_adder; + + if (n_ev > ppmu->n_counter) + return -1; + + /* First see if the events will go on as-is */ + for (i = 0; i < n_ev; ++i) { + alternatives[i][0] = event[i]; + if (ppmu->get_constraint(event[i], &amasks[i][0], + &avalues[i][0])) + return -1; + choice[i] = 0; + } + value = mask = 0; + for (i = 0; i < n_ev; ++i) { + nv = (value | avalues[i][0]) + (value & avalues[i][0] & addf); + if ((((nv + tadd) ^ value) & mask) != 0 || + (((nv + tadd) ^ avalues[i][0]) & amasks[i][0]) != 0) + break; + value = nv; + mask |= amasks[i][0]; + } + if (i == n_ev) + return 0; /* all OK */ + + /* doesn't work, gather alternatives... */ + if (!ppmu->get_alternatives) + return -1; + for (i = 0; i < n_ev; ++i) { + n_alt[i] = ppmu->get_alternatives(event[i], alternatives[i]); + for (j = 1; j < n_alt[i]; ++j) + ppmu->get_constraint(alternatives[i][j], + &amasks[i][j], &avalues[i][j]); + } + + /* enumerate all possibilities and see if any will work */ + i = 0; + j = -1; + value = mask = nv = 0; + while (i < n_ev) { + if (j >= 0) { + /* we're backtracking, restore context */ + value = svalues[i]; + mask = smasks[i]; + j = choice[i]; + } + /* + * See if any alternative k for event i, + * where k > j, will satisfy the constraints. + */ + while (++j < n_alt[i]) { + nv = (value | avalues[i][j]) + + (value & avalues[i][j] & addf); + if ((((nv + tadd) ^ value) & mask) == 0 && + (((nv + tadd) ^ avalues[i][j]) + & amasks[i][j]) == 0) + break; + } + if (j >= n_alt[i]) { + /* + * No feasible alternative, backtrack + * to event i-1 and continue enumerating its + * alternatives from where we got up to. + */ + if (--i < 0) + return -1; + } else { + /* + * Found a feasible alternative for event i, + * remember where we got up to with this event, + * go on to the next event, and start with + * the first alternative for it. + */ + choice[i] = j; + svalues[i] = value; + smasks[i] = mask; + value = nv; + mask |= amasks[i][j]; + ++i; + j = -1; + } + } + + /* OK, we have a feasible combination, tell the caller the solution */ + for (i = 0; i < n_ev; ++i) + event[i] = alternatives[i][choice[i]]; + return 0; +} + +/* + * Check if newly-added counters have consistent settings for + * exclude_{user,kernel,hv} with each other and any previously + * added counters. + */ +static int check_excludes(struct perf_counter **ctrs, int n_prev, int n_new) +{ + int eu, ek, eh; + int i, n; + struct perf_counter *counter; + + n = n_prev + n_new; + if (n <= 1) + return 0; + + eu = ctrs[0]->hw_event.exclude_user; + ek = ctrs[0]->hw_event.exclude_kernel; + eh = ctrs[0]->hw_event.exclude_hv; + if (n_prev == 0) + n_prev = 1; + for (i = n_prev; i < n; ++i) { + counter = ctrs[i]; + if (counter->hw_event.exclude_user != eu || + counter->hw_event.exclude_kernel != ek || + counter->hw_event.exclude_hv != eh) + return -EAGAIN; + } + return 0; +} + +static void power_perf_read(struct perf_counter *counter) +{ + long val, delta, prev; + + if (!counter->hw.idx) + return; + /* + * Performance monitor interrupts come even when interrupts + * are soft-disabled, as long as interrupts are hard-enabled. + * Therefore we treat them like NMIs. + */ + do { + prev = atomic64_read(&counter->hw.prev_count); + barrier(); + val = read_pmc(counter->hw.idx); + } while (atomic64_cmpxchg(&counter->hw.prev_count, prev, val) != prev); + + /* The counters are only 32 bits wide */ + delta = (val - prev) & 0xfffffffful; + atomic64_add(delta, &counter->count); + atomic64_sub(delta, &counter->hw.period_left); +} + +/* + * Disable all counters to prevent PMU interrupts and to allow + * counters to be added or removed. + */ +u64 hw_perf_save_disable(void) +{ + struct cpu_hw_counters *cpuhw; + unsigned long ret; + unsigned long flags; + + local_irq_save(flags); + cpuhw = &__get_cpu_var(cpu_hw_counters); + + ret = cpuhw->disabled; + if (!ret) { + cpuhw->disabled = 1; + cpuhw->n_added = 0; + + /* + * Check if we ever enabled the PMU on this cpu. + */ + if (!cpuhw->pmcs_enabled) { + if (ppc_md.enable_pmcs) + ppc_md.enable_pmcs(); + cpuhw->pmcs_enabled = 1; + } + + /* + * Set the 'freeze counters' bit. + * The barrier is to make sure the mtspr has been + * executed and the PMU has frozen the counters + * before we return. + */ + mtspr(SPRN_MMCR0, mfspr(SPRN_MMCR0) | MMCR0_FC); + mb(); + } + local_irq_restore(flags); + return ret; +} + +/* + * Re-enable all counters if disable == 0. + * If we were previously disabled and counters were added, then + * put the new config on the PMU. + */ +void hw_perf_restore(u64 disable) +{ + struct perf_counter *counter; + struct cpu_hw_counters *cpuhw; + unsigned long flags; + long i; + unsigned long val; + s64 left; + unsigned int hwc_index[MAX_HWCOUNTERS]; + + if (disable) + return; + local_irq_save(flags); + cpuhw = &__get_cpu_var(cpu_hw_counters); + cpuhw->disabled = 0; + + /* + * If we didn't change anything, or only removed counters, + * no need to recalculate MMCR* settings and reset the PMCs. + * Just reenable the PMU with the current MMCR* settings + * (possibly updated for removal of counters). + */ + if (!cpuhw->n_added) { + mtspr(SPRN_MMCRA, cpuhw->mmcr[2]); + mtspr(SPRN_MMCR1, cpuhw->mmcr[1]); + mtspr(SPRN_MMCR0, cpuhw->mmcr[0]); + if (cpuhw->n_counters == 0) + get_lppaca()->pmcregs_in_use = 0; + goto out; + } + + /* + * Compute MMCR* values for the new set of counters + */ + if (ppmu->compute_mmcr(cpuhw->events, cpuhw->n_counters, hwc_index, + cpuhw->mmcr)) { + /* shouldn't ever get here */ + printk(KERN_ERR "oops compute_mmcr failed\n"); + goto out; + } + + /* + * Add in MMCR0 freeze bits corresponding to the + * hw_event.exclude_* bits for the first counter. + * We have already checked that all counters have the + * same values for these bits as the first counter. + */ + counter = cpuhw->counter[0]; + if (counter->hw_event.exclude_user) + cpuhw->mmcr[0] |= MMCR0_FCP; + if (counter->hw_event.exclude_kernel) + cpuhw->mmcr[0] |= MMCR0_FCS; + if (counter->hw_event.exclude_hv) + cpuhw->mmcr[0] |= MMCR0_FCHV; + + /* + * Write the new configuration to MMCR* with the freeze + * bit set and set the hardware counters to their initial values. + * Then unfreeze the counters. + */ + get_lppaca()->pmcregs_in_use = 1; + mtspr(SPRN_MMCRA, cpuhw->mmcr[2]); + mtspr(SPRN_MMCR1, cpuhw->mmcr[1]); + mtspr(SPRN_MMCR0, (cpuhw->mmcr[0] & ~(MMCR0_PMC1CE | MMCR0_PMCjCE)) + | MMCR0_FC); + + /* + * Read off any pre-existing counters that need to move + * to another PMC. + */ + for (i = 0; i < cpuhw->n_counters; ++i) { + counter = cpuhw->counter[i]; + if (counter->hw.idx && counter->hw.idx != hwc_index[i] + 1) { + power_perf_read(counter); + write_pmc(counter->hw.idx, 0); + counter->hw.idx = 0; + } + } + + /* + * Initialize the PMCs for all the new and moved counters. + */ + for (i = 0; i < cpuhw->n_counters; ++i) { + counter = cpuhw->counter[i]; + if (counter->hw.idx) + continue; + val = 0; + if (counter->hw_event.irq_period) { + left = atomic64_read(&counter->hw.period_left); + if (left < 0x80000000L) + val = 0x80000000L - left; + } + atomic64_set(&counter->hw.prev_count, val); + counter->hw.idx = hwc_index[i] + 1; + write_pmc(counter->hw.idx, val); + } + mb(); + cpuhw->mmcr[0] |= MMCR0_PMXE | MMCR0_FCECE; + mtspr(SPRN_MMCR0, cpuhw->mmcr[0]); + + out: + local_irq_restore(flags); +} + +static int collect_events(struct perf_counter *group, int max_count, + struct perf_counter *ctrs[], unsigned int *events) +{ + int n = 0; + struct perf_counter *counter; + + if (!is_software_counter(group)) { + if (n >= max_count) + return -1; + ctrs[n] = group; + events[n++] = group->hw.config; + } + list_for_each_entry(counter, &group->sibling_list, list_entry) { + if (!is_software_counter(counter) && + counter->state != PERF_COUNTER_STATE_OFF) { + if (n >= max_count) + return -1; + ctrs[n] = counter; + events[n++] = counter->hw.config; + } + } + return n; +} + +static void counter_sched_in(struct perf_counter *counter, int cpu) +{ + counter->state = PERF_COUNTER_STATE_ACTIVE; + counter->oncpu = cpu; + if (is_software_counter(counter)) + counter->hw_ops->enable(counter); +} + +/* + * Called to enable a whole group of counters. + * Returns 1 if the group was enabled, or -EAGAIN if it could not be. + * Assumes the caller has disabled interrupts and has + * frozen the PMU with hw_perf_save_disable. + */ +int hw_perf_group_sched_in(struct perf_counter *group_leader, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx, int cpu) +{ + struct cpu_hw_counters *cpuhw; + long i, n, n0; + struct perf_counter *sub; + + cpuhw = &__get_cpu_var(cpu_hw_counters); + n0 = cpuhw->n_counters; + n = collect_events(group_leader, ppmu->n_counter - n0, + &cpuhw->counter[n0], &cpuhw->events[n0]); + if (n < 0) + return -EAGAIN; + if (check_excludes(cpuhw->counter, n0, n)) + return -EAGAIN; + if (power_check_constraints(cpuhw->events, n + n0)) + return -EAGAIN; + cpuhw->n_counters = n0 + n; + cpuhw->n_added += n; + + /* + * OK, this group can go on; update counter states etc., + * and enable any software counters + */ + for (i = n0; i < n0 + n; ++i) + cpuhw->counter[i]->hw.config = cpuhw->events[i]; + cpuctx->active_oncpu += n; + n = 1; + counter_sched_in(group_leader, cpu); + list_for_each_entry(sub, &group_leader->sibling_list, list_entry) { + if (sub->state != PERF_COUNTER_STATE_OFF) { + counter_sched_in(sub, cpu); + ++n; + } + } + ctx->nr_active += n; + + return 1; +} + +/* + * Add a counter to the PMU. + * If all counters are not already frozen, then we disable and + * re-enable the PMU in order to get hw_perf_restore to do the + * actual work of reconfiguring the PMU. + */ +static int power_perf_enable(struct perf_counter *counter) +{ + struct cpu_hw_counters *cpuhw; + unsigned long flags; + u64 pmudis; + int n0; + int ret = -EAGAIN; + + local_irq_save(flags); + pmudis = hw_perf_save_disable(); + + /* + * Add the counter to the list (if there is room) + * and check whether the total set is still feasible. + */ + cpuhw = &__get_cpu_var(cpu_hw_counters); + n0 = cpuhw->n_counters; + if (n0 >= ppmu->n_counter) + goto out; + cpuhw->counter[n0] = counter; + cpuhw->events[n0] = counter->hw.config; + if (check_excludes(cpuhw->counter, n0, 1)) + goto out; + if (power_check_constraints(cpuhw->events, n0 + 1)) + goto out; + + counter->hw.config = cpuhw->events[n0]; + ++cpuhw->n_counters; + ++cpuhw->n_added; + + ret = 0; + out: + hw_perf_restore(pmudis); + local_irq_restore(flags); + return ret; +} + +/* + * Remove a counter from the PMU. + */ +static void power_perf_disable(struct perf_counter *counter) +{ + struct cpu_hw_counters *cpuhw; + long i; + u64 pmudis; + unsigned long flags; + + local_irq_save(flags); + pmudis = hw_perf_save_disable(); + + power_perf_read(counter); + + cpuhw = &__get_cpu_var(cpu_hw_counters); + for (i = 0; i < cpuhw->n_counters; ++i) { + if (counter == cpuhw->counter[i]) { + while (++i < cpuhw->n_counters) + cpuhw->counter[i-1] = cpuhw->counter[i]; + --cpuhw->n_counters; + ppmu->disable_pmc(counter->hw.idx - 1, cpuhw->mmcr); + write_pmc(counter->hw.idx, 0); + counter->hw.idx = 0; + break; + } + } + if (cpuhw->n_counters == 0) { + /* disable exceptions if no counters are running */ + cpuhw->mmcr[0] &= ~(MMCR0_PMXE | MMCR0_FCECE); + } + + hw_perf_restore(pmudis); + local_irq_restore(flags); +} + +struct hw_perf_counter_ops power_perf_ops = { + .enable = power_perf_enable, + .disable = power_perf_disable, + .read = power_perf_read +}; + +const struct hw_perf_counter_ops * +hw_perf_counter_init(struct perf_counter *counter) +{ + unsigned long ev; + struct perf_counter *ctrs[MAX_HWCOUNTERS]; + unsigned int events[MAX_HWCOUNTERS]; + int n; + + if (!ppmu) + return NULL; + if ((s64)counter->hw_event.irq_period < 0) + return NULL; + ev = counter->hw_event.type; + if (!counter->hw_event.raw) { + if (ev >= ppmu->n_generic || + ppmu->generic_events[ev] == 0) + return NULL; + ev = ppmu->generic_events[ev]; + } + counter->hw.config_base = ev; + counter->hw.idx = 0; + + /* + * If we are not running on a hypervisor, force the + * exclude_hv bit to 0 so that we don't care what + * the user set it to. This also means that we don't + * set the MMCR0_FCHV bit, which unconditionally freezes + * the counters on the PPC970 variants used in Apple G5 + * machines (since MSR.HV is always 1 on those machines). + */ + if (!firmware_has_feature(FW_FEATURE_LPAR)) + counter->hw_event.exclude_hv = 0; + + /* + * If this is in a group, check if it can go on with all the + * other hardware counters in the group. We assume the counter + * hasn't been linked into its leader's sibling list at this point. + */ + n = 0; + if (counter->group_leader != counter) { + n = collect_events(counter->group_leader, ppmu->n_counter - 1, + ctrs, events); + if (n < 0) + return NULL; + } + events[n] = ev; + if (check_excludes(ctrs, n, 1)) + return NULL; + if (power_check_constraints(events, n + 1)) + return NULL; + + counter->hw.config = events[n]; + atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period); + return &power_perf_ops; +} + +/* + * Handle wakeups. + */ +void perf_counter_do_pending(void) +{ + int i; + struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters); + struct perf_counter *counter; + + set_perf_counter_pending(0); + for (i = 0; i < cpuhw->n_counters; ++i) { + counter = cpuhw->counter[i]; + if (counter && counter->wakeup_pending) { + counter->wakeup_pending = 0; + wake_up(&counter->waitq); + } + } +} + +/* + * Record data for an irq counter. + * This function was lifted from the x86 code; maybe it should + * go in the core? + */ +static void perf_store_irq_data(struct perf_counter *counter, u64 data) +{ + struct perf_data *irqdata = counter->irqdata; + + if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) { + irqdata->overrun++; + } else { + u64 *p = (u64 *) &irqdata->data[irqdata->len]; + + *p = data; + irqdata->len += sizeof(u64); + } +} + +/* + * Record all the values of the counters in a group + */ +static void perf_handle_group(struct perf_counter *counter) +{ + struct perf_counter *leader, *sub; + + leader = counter->group_leader; + list_for_each_entry(sub, &leader->sibling_list, list_entry) { + if (sub != counter) + sub->hw_ops->read(sub); + perf_store_irq_data(counter, sub->hw_event.type); + perf_store_irq_data(counter, atomic64_read(&sub->count)); + } +} + +/* + * A counter has overflowed; update its count and record + * things if requested. Note that interrupts are hard-disabled + * here so there is no possibility of being interrupted. + */ +static void record_and_restart(struct perf_counter *counter, long val, + struct pt_regs *regs) +{ + s64 prev, delta, left; + int record = 0; + + /* we don't have to worry about interrupts here */ + prev = atomic64_read(&counter->hw.prev_count); + delta = (val - prev) & 0xfffffffful; + atomic64_add(delta, &counter->count); + + /* + * See if the total period for this counter has expired, + * and update for the next period. + */ + val = 0; + left = atomic64_read(&counter->hw.period_left) - delta; + if (counter->hw_event.irq_period) { + if (left <= 0) { + left += counter->hw_event.irq_period; + if (left <= 0) + left = counter->hw_event.irq_period; + record = 1; + } + if (left < 0x80000000L) + val = 0x80000000L - left; + } + write_pmc(counter->hw.idx, val); + atomic64_set(&counter->hw.prev_count, val); + atomic64_set(&counter->hw.period_left, left); + + /* + * Finally record data if requested. + */ + if (record) { + switch (counter->hw_event.record_type) { + case PERF_RECORD_SIMPLE: + break; + case PERF_RECORD_IRQ: + perf_store_irq_data(counter, instruction_pointer(regs)); + counter->wakeup_pending = 1; + break; + case PERF_RECORD_GROUP: + perf_handle_group(counter); + counter->wakeup_pending = 1; + break; + } + } +} + +/* + * Performance monitor interrupt stuff + */ +static void perf_counter_interrupt(struct pt_regs *regs) +{ + int i; + struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters); + struct perf_counter *counter; + long val; + int need_wakeup = 0, found = 0; + + for (i = 0; i < cpuhw->n_counters; ++i) { + counter = cpuhw->counter[i]; + val = read_pmc(counter->hw.idx); + if ((int)val < 0) { + /* counter has overflowed */ + found = 1; + record_and_restart(counter, val, regs); + if (counter->wakeup_pending) + need_wakeup = 1; + } + } + + /* + * In case we didn't find and reset the counter that caused + * the interrupt, scan all counters and reset any that are + * negative, to avoid getting continual interrupts. + * Any that we processed in the previous loop will not be negative. + */ + if (!found) { + for (i = 0; i < ppmu->n_counter; ++i) { + val = read_pmc(i + 1); + if ((int)val < 0) + write_pmc(i + 1, 0); + } + } + + /* + * Reset MMCR0 to its normal value. This will set PMXE and + * clear FC (freeze counters) and PMAO (perf mon alert occurred) + * and thus allow interrupts to occur again. + * XXX might want to use MSR.PM to keep the counters frozen until + * we get back out of this interrupt. + */ + mtspr(SPRN_MMCR0, cpuhw->mmcr[0]); + + /* + * If we need a wakeup, check whether interrupts were soft-enabled + * when we took the interrupt. If they were, we can wake stuff up + * immediately; otherwise we'll have to set a flag and do the + * wakeup when interrupts get soft-enabled. + */ + if (need_wakeup) { + if (regs->softe) { + irq_enter(); + perf_counter_do_pending(); + irq_exit(); + } else { + set_perf_counter_pending(1); + } + } +} + +void hw_perf_counter_setup(int cpu) +{ + struct cpu_hw_counters *cpuhw = &per_cpu(cpu_hw_counters, cpu); + + memset(cpuhw, 0, sizeof(*cpuhw)); + cpuhw->mmcr[0] = MMCR0_FC; +} + +extern struct power_pmu ppc970_pmu; +extern struct power_pmu power6_pmu; + +static int init_perf_counters(void) +{ + unsigned long pvr; + + if (reserve_pmc_hardware(perf_counter_interrupt)) { + printk(KERN_ERR "Couldn't init performance monitor subsystem\n"); + return -EBUSY; + } + + /* XXX should get this from cputable */ + pvr = mfspr(SPRN_PVR); + switch (PVR_VER(pvr)) { + case PV_970: + case PV_970FX: + case PV_970MP: + ppmu = &ppc970_pmu; + break; + case 0x3e: + ppmu = &power6_pmu; + break; + } + return 0; +} + +arch_initcall(init_perf_counters); Index: linux-2.6-tip/arch/powerpc/kernel/power6-pmu.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/powerpc/kernel/power6-pmu.c @@ -0,0 +1,283 @@ +/* + * Performance counter support for POWER6 processors. + * + * Copyright 2008-2009 Paul Mackerras, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include + +/* + * Bits in event code for POWER6 + */ +#define PM_PMC_SH 20 /* PMC number (1-based) for direct events */ +#define PM_PMC_MSK 0x7 +#define PM_PMC_MSKS (PM_PMC_MSK << PM_PMC_SH) +#define PM_UNIT_SH 16 /* Unit event comes (TTMxSEL encoding) */ +#define PM_UNIT_MSK 0xf +#define PM_UNIT_MSKS (PM_UNIT_MSK << PM_UNIT_SH) +#define PM_LLAV 0x8000 /* Load lookahead match value */ +#define PM_LLA 0x4000 /* Load lookahead match enable */ +#define PM_BYTE_SH 12 /* Byte of event bus to use */ +#define PM_BYTE_MSK 3 +#define PM_SUBUNIT_SH 8 /* Subunit event comes from (NEST_SEL enc.) */ +#define PM_SUBUNIT_MSK 7 +#define PM_SUBUNIT_MSKS (PM_SUBUNIT_MSK << PM_SUBUNIT_SH) +#define PM_PMCSEL_MSK 0xff /* PMCxSEL value */ +#define PM_BUSEVENT_MSK 0xf3700 + +/* + * Bits in MMCR1 for POWER6 + */ +#define MMCR1_TTM0SEL_SH 60 +#define MMCR1_TTMSEL_SH(n) (MMCR1_TTM0SEL_SH - (n) * 4) +#define MMCR1_TTMSEL_MSK 0xf +#define MMCR1_TTMSEL(m, n) (((m) >> MMCR1_TTMSEL_SH(n)) & MMCR1_TTMSEL_MSK) +#define MMCR1_NESTSEL_SH 45 +#define MMCR1_NESTSEL_MSK 0x7 +#define MMCR1_NESTSEL(m) (((m) >> MMCR1_NESTSEL_SH) & MMCR1_NESTSEL_MSK) +#define MMCR1_PMC1_LLA ((u64)1 << 44) +#define MMCR1_PMC1_LLA_VALUE ((u64)1 << 39) +#define MMCR1_PMC1_ADDR_SEL ((u64)1 << 35) +#define MMCR1_PMC1SEL_SH 24 +#define MMCR1_PMCSEL_SH(n) (MMCR1_PMC1SEL_SH - (n) * 8) +#define MMCR1_PMCSEL_MSK 0xff + +/* + * Assign PMC numbers and compute MMCR1 value for a set of events + */ +static int p6_compute_mmcr(unsigned int event[], int n_ev, + unsigned int hwc[], u64 mmcr[]) +{ + u64 mmcr1 = 0; + int i; + unsigned int pmc, ev, b, u, s, psel; + unsigned int ttmset = 0; + unsigned int pmc_inuse = 0; + + if (n_ev > 4) + return -1; + for (i = 0; i < n_ev; ++i) { + pmc = (event[i] >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc) { + if (pmc_inuse & (1 << (pmc - 1))) + return -1; /* collision! */ + pmc_inuse |= 1 << (pmc - 1); + } + } + for (i = 0; i < n_ev; ++i) { + ev = event[i]; + pmc = (ev >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc) { + --pmc; + } else { + /* can go on any PMC; find a free one */ + for (pmc = 0; pmc < 4; ++pmc) + if (!(pmc_inuse & (1 << pmc))) + break; + pmc_inuse |= 1 << pmc; + } + hwc[i] = pmc; + psel = ev & PM_PMCSEL_MSK; + if (ev & PM_BUSEVENT_MSK) { + /* this event uses the event bus */ + b = (ev >> PM_BYTE_SH) & PM_BYTE_MSK; + u = (ev >> PM_UNIT_SH) & PM_UNIT_MSK; + /* check for conflict on this byte of event bus */ + if ((ttmset & (1 << b)) && MMCR1_TTMSEL(mmcr1, b) != u) + return -1; + mmcr1 |= (u64)u << MMCR1_TTMSEL_SH(b); + ttmset |= 1 << b; + if (u == 5) { + /* Nest events have a further mux */ + s = (ev >> PM_SUBUNIT_SH) & PM_SUBUNIT_MSK; + if ((ttmset & 0x10) && + MMCR1_NESTSEL(mmcr1) != s) + return -1; + ttmset |= 0x10; + mmcr1 |= (u64)s << MMCR1_NESTSEL_SH; + } + if (0x30 <= psel && psel <= 0x3d) { + /* these need the PMCx_ADDR_SEL bits */ + if (b >= 2) + mmcr1 |= MMCR1_PMC1_ADDR_SEL >> pmc; + } + /* bus select values are different for PMC3/4 */ + if (pmc >= 2 && (psel & 0x90) == 0x80) + psel ^= 0x20; + } + if (ev & PM_LLA) { + mmcr1 |= MMCR1_PMC1_LLA >> pmc; + if (ev & PM_LLAV) + mmcr1 |= MMCR1_PMC1_LLA_VALUE >> pmc; + } + mmcr1 |= (u64)psel << MMCR1_PMCSEL_SH(pmc); + } + mmcr[0] = 0; + if (pmc_inuse & 1) + mmcr[0] = MMCR0_PMC1CE; + if (pmc_inuse & 0xe) + mmcr[0] |= MMCR0_PMCjCE; + mmcr[1] = mmcr1; + mmcr[2] = 0; + return 0; +} + +/* + * Layout of constraint bits: + * + * 0-1 add field: number of uses of PMC1 (max 1) + * 2-3, 4-5, 6-7: ditto for PMC2, 3, 4 + * 8-10 select field: nest (subunit) event selector + * 16-19 select field: unit on byte 0 of event bus + * 20-23, 24-27, 28-31 ditto for bytes 1, 2, 3 + */ +static int p6_get_constraint(unsigned int event, u64 *maskp, u64 *valp) +{ + int pmc, byte, sh; + unsigned int mask = 0, value = 0; + + pmc = (event >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc) { + if (pmc > 4) + return -1; + sh = (pmc - 1) * 2; + mask |= 2 << sh; + value |= 1 << sh; + } + if (event & PM_BUSEVENT_MSK) { + byte = (event >> PM_BYTE_SH) & PM_BYTE_MSK; + sh = byte * 4; + mask |= PM_UNIT_MSKS << sh; + value |= (event & PM_UNIT_MSKS) << sh; + if ((event & PM_UNIT_MSKS) == (5 << PM_UNIT_SH)) { + mask |= PM_SUBUNIT_MSKS; + value |= event & PM_SUBUNIT_MSKS; + } + } + *maskp = mask; + *valp = value; + return 0; +} + +#define MAX_ALT 4 /* at most 4 alternatives for any event */ + +static const unsigned int event_alternatives[][MAX_ALT] = { + { 0x0130e8, 0x2000f6, 0x3000fc }, /* PM_PTEG_RELOAD_VALID */ + { 0x080080, 0x10000d, 0x30000c, 0x4000f0 }, /* PM_LD_MISS_L1 */ + { 0x080088, 0x200054, 0x3000f0 }, /* PM_ST_MISS_L1 */ + { 0x10000a, 0x2000f4 }, /* PM_RUN_CYC */ + { 0x10000b, 0x2000f5 }, /* PM_RUN_COUNT */ + { 0x10000e, 0x400010 }, /* PM_PURR */ + { 0x100010, 0x4000f8 }, /* PM_FLUSH */ + { 0x10001a, 0x200010 }, /* PM_MRK_INST_DISP */ + { 0x100026, 0x3000f8 }, /* PM_TB_BIT_TRANS */ + { 0x100054, 0x2000f0 }, /* PM_ST_FIN */ + { 0x100056, 0x2000fc }, /* PM_L1_ICACHE_MISS */ + { 0x1000f0, 0x40000a }, /* PM_INST_IMC_MATCH_CMPL */ + { 0x1000f8, 0x200008 }, /* PM_GCT_EMPTY_CYC */ + { 0x1000fc, 0x400006 }, /* PM_LSU_DERAT_MISS_CYC */ + { 0x20000e, 0x400007 }, /* PM_LSU_DERAT_MISS */ + { 0x200012, 0x300012 }, /* PM_INST_DISP */ + { 0x2000f2, 0x3000f2 }, /* PM_INST_DISP */ + { 0x2000f8, 0x300010 }, /* PM_EXT_INT */ + { 0x2000fe, 0x300056 }, /* PM_DATA_FROM_L2MISS */ + { 0x2d0030, 0x30001a }, /* PM_MRK_FPU_FIN */ + { 0x30000a, 0x400018 }, /* PM_MRK_INST_FIN */ + { 0x3000f6, 0x40000e }, /* PM_L1_DCACHE_RELOAD_VALID */ + { 0x3000fe, 0x400056 }, /* PM_DATA_FROM_L3MISS */ +}; + +/* + * This could be made more efficient with a binary search on + * a presorted list, if necessary + */ +static int find_alternatives_list(unsigned int event) +{ + int i, j; + unsigned int alt; + + for (i = 0; i < ARRAY_SIZE(event_alternatives); ++i) { + if (event < event_alternatives[i][0]) + return -1; + for (j = 0; j < MAX_ALT; ++j) { + alt = event_alternatives[i][j]; + if (!alt || event < alt) + break; + if (event == alt) + return i; + } + } + return -1; +} + +static int p6_get_alternatives(unsigned int event, unsigned int alt[]) +{ + int i, j; + unsigned int aevent, psel, pmc; + unsigned int nalt = 1; + + alt[0] = event; + + /* check the alternatives table */ + i = find_alternatives_list(event); + if (i >= 0) { + /* copy out alternatives from list */ + for (j = 0; j < MAX_ALT; ++j) { + aevent = event_alternatives[i][j]; + if (!aevent) + break; + if (aevent != event) + alt[nalt++] = aevent; + } + + } else { + /* Check for alternative ways of computing sum events */ + /* PMCSEL 0x32 counter N == PMCSEL 0x34 counter 5-N */ + psel = event & (PM_PMCSEL_MSK & ~1); /* ignore edge bit */ + pmc = (event >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc && (psel == 0x32 || psel == 0x34)) + alt[nalt++] = ((event ^ 0x6) & ~PM_PMC_MSKS) | + ((5 - pmc) << PM_PMC_SH); + + /* PMCSEL 0x38 counter N == PMCSEL 0x3a counter N+/-2 */ + if (pmc && (psel == 0x38 || psel == 0x3a)) + alt[nalt++] = ((event ^ 0x2) & ~PM_PMC_MSKS) | + ((pmc > 2? pmc - 2: pmc + 2) << PM_PMC_SH); + } + + return nalt; +} + +static void p6_disable_pmc(unsigned int pmc, u64 mmcr[]) +{ + /* Set PMCxSEL to 0 to disable PMCx */ + mmcr[1] &= ~(0xffUL << MMCR1_PMCSEL_SH(pmc)); +} + +static int power6_generic_events[] = { + [PERF_COUNT_CPU_CYCLES] = 0x1e, + [PERF_COUNT_INSTRUCTIONS] = 2, + [PERF_COUNT_CACHE_REFERENCES] = 0x280030, /* LD_REF_L1 */ + [PERF_COUNT_CACHE_MISSES] = 0x30000c, /* LD_MISS_L1 */ + [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x410a0, /* BR_PRED */ + [PERF_COUNT_BRANCH_MISSES] = 0x400052, /* BR_MPRED */ +}; + +struct power_pmu power6_pmu = { + .n_counter = 4, + .max_alternatives = MAX_ALT, + .add_fields = 0x55, + .test_adder = 0, + .compute_mmcr = p6_compute_mmcr, + .get_constraint = p6_get_constraint, + .get_alternatives = p6_get_alternatives, + .disable_pmc = p6_disable_pmc, + .n_generic = ARRAY_SIZE(power6_generic_events), + .generic_events = power6_generic_events, +}; Index: linux-2.6-tip/arch/powerpc/kernel/ppc970-pmu.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/powerpc/kernel/ppc970-pmu.c @@ -0,0 +1,375 @@ +/* + * Performance counter support for PPC970-family processors. + * + * Copyright 2008-2009 Paul Mackerras, IBM Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include + +/* + * Bits in event code for PPC970 + */ +#define PM_PMC_SH 12 /* PMC number (1-based) for direct events */ +#define PM_PMC_MSK 0xf +#define PM_UNIT_SH 8 /* TTMMUX number and setting - unit select */ +#define PM_UNIT_MSK 0xf +#define PM_BYTE_SH 4 /* Byte number of event bus to use */ +#define PM_BYTE_MSK 3 +#define PM_PMCSEL_MSK 0xf + +/* Values in PM_UNIT field */ +#define PM_NONE 0 +#define PM_FPU 1 +#define PM_VPU 2 +#define PM_ISU 3 +#define PM_IFU 4 +#define PM_IDU 5 +#define PM_STS 6 +#define PM_LSU0 7 +#define PM_LSU1U 8 +#define PM_LSU1L 9 +#define PM_LASTUNIT 9 + +/* + * Bits in MMCR0 for PPC970 + */ +#define MMCR0_PMC1SEL_SH 8 +#define MMCR0_PMC2SEL_SH 1 +#define MMCR_PMCSEL_MSK 0x1f + +/* + * Bits in MMCR1 for PPC970 + */ +#define MMCR1_TTM0SEL_SH 62 +#define MMCR1_TTM1SEL_SH 59 +#define MMCR1_TTM3SEL_SH 53 +#define MMCR1_TTMSEL_MSK 3 +#define MMCR1_TD_CP_DBG0SEL_SH 50 +#define MMCR1_TD_CP_DBG1SEL_SH 48 +#define MMCR1_TD_CP_DBG2SEL_SH 46 +#define MMCR1_TD_CP_DBG3SEL_SH 44 +#define MMCR1_PMC1_ADDER_SEL_SH 39 +#define MMCR1_PMC2_ADDER_SEL_SH 38 +#define MMCR1_PMC6_ADDER_SEL_SH 37 +#define MMCR1_PMC5_ADDER_SEL_SH 36 +#define MMCR1_PMC8_ADDER_SEL_SH 35 +#define MMCR1_PMC7_ADDER_SEL_SH 34 +#define MMCR1_PMC3_ADDER_SEL_SH 33 +#define MMCR1_PMC4_ADDER_SEL_SH 32 +#define MMCR1_PMC3SEL_SH 27 +#define MMCR1_PMC4SEL_SH 22 +#define MMCR1_PMC5SEL_SH 17 +#define MMCR1_PMC6SEL_SH 12 +#define MMCR1_PMC7SEL_SH 7 +#define MMCR1_PMC8SEL_SH 2 + +static short mmcr1_adder_bits[8] = { + MMCR1_PMC1_ADDER_SEL_SH, + MMCR1_PMC2_ADDER_SEL_SH, + MMCR1_PMC3_ADDER_SEL_SH, + MMCR1_PMC4_ADDER_SEL_SH, + MMCR1_PMC5_ADDER_SEL_SH, + MMCR1_PMC6_ADDER_SEL_SH, + MMCR1_PMC7_ADDER_SEL_SH, + MMCR1_PMC8_ADDER_SEL_SH +}; + +/* + * Bits in MMCRA + */ + +/* + * Layout of constraint bits: + * 6666555555555544444444443333333333222222222211111111110000000000 + * 3210987654321098765432109876543210987654321098765432109876543210 + * <><>[ >[ >[ >< >< >< >< ><><><><><><><><> + * T0T1 UC PS1 PS2 B0 B1 B2 B3 P1P2P3P4P5P6P7P8 + * + * T0 - TTM0 constraint + * 46-47: TTM0SEL value (0=FPU, 2=IFU, 3=VPU) 0xC000_0000_0000 + * + * T1 - TTM1 constraint + * 44-45: TTM1SEL value (0=IDU, 3=STS) 0x3000_0000_0000 + * + * UC - unit constraint: can't have all three of FPU|IFU|VPU, ISU, IDU|STS + * 43: UC3 error 0x0800_0000_0000 + * 42: FPU|IFU|VPU events needed 0x0400_0000_0000 + * 41: ISU events needed 0x0200_0000_0000 + * 40: IDU|STS events needed 0x0100_0000_0000 + * + * PS1 + * 39: PS1 error 0x0080_0000_0000 + * 36-38: count of events needing PMC1/2/5/6 0x0070_0000_0000 + * + * PS2 + * 35: PS2 error 0x0008_0000_0000 + * 32-34: count of events needing PMC3/4/7/8 0x0007_0000_0000 + * + * B0 + * 28-31: Byte 0 event source 0xf000_0000 + * Encoding as for the event code + * + * B1, B2, B3 + * 24-27, 20-23, 16-19: Byte 1, 2, 3 event sources + * + * P1 + * 15: P1 error 0x8000 + * 14-15: Count of events needing PMC1 + * + * P2..P8 + * 0-13: Count of events needing PMC2..PMC8 + */ + +/* Masks and values for using events from the various units */ +static u64 unit_cons[PM_LASTUNIT+1][2] = { + [PM_FPU] = { 0xc80000000000ull, 0x040000000000ull }, + [PM_VPU] = { 0xc80000000000ull, 0xc40000000000ull }, + [PM_ISU] = { 0x080000000000ull, 0x020000000000ull }, + [PM_IFU] = { 0xc80000000000ull, 0x840000000000ull }, + [PM_IDU] = { 0x380000000000ull, 0x010000000000ull }, + [PM_STS] = { 0x380000000000ull, 0x310000000000ull }, +}; + +static int p970_get_constraint(unsigned int event, u64 *maskp, u64 *valp) +{ + int pmc, byte, unit, sh; + u64 mask = 0, value = 0; + int grp = -1; + + pmc = (event >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc) { + if (pmc > 8) + return -1; + sh = (pmc - 1) * 2; + mask |= 2 << sh; + value |= 1 << sh; + grp = ((pmc - 1) >> 1) & 1; + } + unit = (event >> PM_UNIT_SH) & PM_UNIT_MSK; + if (unit) { + if (unit > PM_LASTUNIT) + return -1; + mask |= unit_cons[unit][0]; + value |= unit_cons[unit][1]; + byte = (event >> PM_BYTE_SH) & PM_BYTE_MSK; + /* + * Bus events on bytes 0 and 2 can be counted + * on PMC1/2/5/6; bytes 1 and 3 on PMC3/4/7/8. + */ + if (!pmc) + grp = byte & 1; + /* Set byte lane select field */ + mask |= 0xfULL << (28 - 4 * byte); + value |= (u64)unit << (28 - 4 * byte); + } + if (grp == 0) { + /* increment PMC1/2/5/6 field */ + mask |= 0x8000000000ull; + value |= 0x1000000000ull; + } else if (grp == 1) { + /* increment PMC3/4/7/8 field */ + mask |= 0x800000000ull; + value |= 0x100000000ull; + } + *maskp = mask; + *valp = value; + return 0; +} + +static int p970_get_alternatives(unsigned int event, unsigned int alt[]) +{ + alt[0] = event; + + /* 2 alternatives for LSU empty */ + if (event == 0x2002 || event == 0x3002) { + alt[1] = event ^ 0x1000; + return 2; + } + + return 1; +} + +static int p970_compute_mmcr(unsigned int event[], int n_ev, + unsigned int hwc[], u64 mmcr[]) +{ + u64 mmcr0 = 0, mmcr1 = 0, mmcra = 0; + unsigned int pmc, unit, byte, psel; + unsigned int ttm, grp; + unsigned int pmc_inuse = 0; + unsigned int pmc_grp_use[2]; + unsigned char busbyte[4]; + unsigned char unituse[16]; + unsigned char unitmap[] = { 0, 0<<3, 3<<3, 1<<3, 2<<3, 0|4, 3|4 }; + unsigned char ttmuse[2]; + unsigned char pmcsel[8]; + int i; + + if (n_ev > 8) + return -1; + + /* First pass to count resource use */ + pmc_grp_use[0] = pmc_grp_use[1] = 0; + memset(busbyte, 0, sizeof(busbyte)); + memset(unituse, 0, sizeof(unituse)); + for (i = 0; i < n_ev; ++i) { + pmc = (event[i] >> PM_PMC_SH) & PM_PMC_MSK; + if (pmc) { + if (pmc_inuse & (1 << (pmc - 1))) + return -1; + pmc_inuse |= 1 << (pmc - 1); + /* count 1/2/5/6 vs 3/4/7/8 use */ + ++pmc_grp_use[((pmc - 1) >> 1) & 1]; + } + unit = (event[i] >> PM_UNIT_SH) & PM_UNIT_MSK; + byte = (event[i] >> PM_BYTE_SH) & PM_BYTE_MSK; + if (unit) { + if (unit > PM_LASTUNIT) + return -1; + if (!pmc) + ++pmc_grp_use[byte & 1]; + if (busbyte[byte] && busbyte[byte] != unit) + return -1; + busbyte[byte] = unit; + unituse[unit] = 1; + } + } + if (pmc_grp_use[0] > 4 || pmc_grp_use[1] > 4) + return -1; + + /* + * Assign resources and set multiplexer selects. + * + * PM_ISU can go either on TTM0 or TTM1, but that's the only + * choice we have to deal with. + */ + if (unituse[PM_ISU] & + (unituse[PM_FPU] | unituse[PM_IFU] | unituse[PM_VPU])) + unitmap[PM_ISU] = 2 | 4; /* move ISU to TTM1 */ + /* Set TTM[01]SEL fields. */ + ttmuse[0] = ttmuse[1] = 0; + for (i = PM_FPU; i <= PM_STS; ++i) { + if (!unituse[i]) + continue; + ttm = unitmap[i]; + ++ttmuse[(ttm >> 2) & 1]; + mmcr1 |= (u64)(ttm & ~4) << MMCR1_TTM1SEL_SH; + } + /* Check only one unit per TTMx */ + if (ttmuse[0] > 1 || ttmuse[1] > 1) + return -1; + + /* Set byte lane select fields and TTM3SEL. */ + for (byte = 0; byte < 4; ++byte) { + unit = busbyte[byte]; + if (!unit) + continue; + if (unit <= PM_STS) + ttm = (unitmap[unit] >> 2) & 1; + else if (unit == PM_LSU0) + ttm = 2; + else { + ttm = 3; + if (unit == PM_LSU1L && byte >= 2) + mmcr1 |= 1ull << (MMCR1_TTM3SEL_SH + 3 - byte); + } + mmcr1 |= (u64)ttm << (MMCR1_TD_CP_DBG0SEL_SH - 2 * byte); + } + + /* Second pass: assign PMCs, set PMCxSEL and PMCx_ADDER_SEL fields */ + memset(pmcsel, 0x8, sizeof(pmcsel)); /* 8 means don't count */ + for (i = 0; i < n_ev; ++i) { + pmc = (event[i] >> PM_PMC_SH) & PM_PMC_MSK; + unit = (event[i] >> PM_UNIT_SH) & PM_UNIT_MSK; + byte = (event[i] >> PM_BYTE_SH) & PM_BYTE_MSK; + psel = event[i] & PM_PMCSEL_MSK; + if (!pmc) { + /* Bus event or any-PMC direct event */ + if (unit) + psel |= 0x10 | ((byte & 2) << 2); + else + psel |= 8; + for (pmc = 0; pmc < 8; ++pmc) { + if (pmc_inuse & (1 << pmc)) + continue; + grp = (pmc >> 1) & 1; + if (unit) { + if (grp == (byte & 1)) + break; + } else if (pmc_grp_use[grp] < 4) { + ++pmc_grp_use[grp]; + break; + } + } + pmc_inuse |= 1 << pmc; + } else { + /* Direct event */ + --pmc; + if (psel == 0 && (byte & 2)) + /* add events on higher-numbered bus */ + mmcr1 |= 1ull << mmcr1_adder_bits[pmc]; + } + pmcsel[pmc] = psel; + hwc[i] = pmc; + } + for (pmc = 0; pmc < 2; ++pmc) + mmcr0 |= pmcsel[pmc] << (MMCR0_PMC1SEL_SH - 7 * pmc); + for (; pmc < 8; ++pmc) + mmcr1 |= (u64)pmcsel[pmc] << (MMCR1_PMC3SEL_SH - 5 * (pmc - 2)); + if (pmc_inuse & 1) + mmcr0 |= MMCR0_PMC1CE; + if (pmc_inuse & 0xfe) + mmcr0 |= MMCR0_PMCjCE; + + mmcra |= 0x2000; /* mark only one IOP per PPC instruction */ + + /* Return MMCRx values */ + mmcr[0] = mmcr0; + mmcr[1] = mmcr1; + mmcr[2] = mmcra; + return 0; +} + +static void p970_disable_pmc(unsigned int pmc, u64 mmcr[]) +{ + int shift, i; + + if (pmc <= 1) { + shift = MMCR0_PMC1SEL_SH - 7 * pmc; + i = 0; + } else { + shift = MMCR1_PMC3SEL_SH - 5 * (pmc - 2); + i = 1; + } + /* + * Setting the PMCxSEL field to 0x08 disables PMC x. + */ + mmcr[i] = (mmcr[i] & ~(0x1fUL << shift)) | (0x08UL << shift); +} + +static int ppc970_generic_events[] = { + [PERF_COUNT_CPU_CYCLES] = 7, + [PERF_COUNT_INSTRUCTIONS] = 1, + [PERF_COUNT_CACHE_REFERENCES] = 0x8810, /* PM_LD_REF_L1 */ + [PERF_COUNT_CACHE_MISSES] = 0x3810, /* PM_LD_MISS_L1 */ + [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x431, /* PM_BR_ISSUED */ + [PERF_COUNT_BRANCH_MISSES] = 0x327, /* PM_GRP_BR_MPRED */ +}; + +struct power_pmu ppc970_pmu = { + .n_counter = 8, + .max_alternatives = 2, + .add_fields = 0x001100005555ull, + .test_adder = 0x013300000000ull, + .compute_mmcr = p970_compute_mmcr, + .get_constraint = p970_get_constraint, + .get_alternatives = p970_get_alternatives, + .disable_pmc = p970_disable_pmc, + .n_generic = ARRAY_SIZE(ppc970_generic_events), + .generic_events = ppc970_generic_events, +}; Index: linux-2.6-tip/arch/powerpc/kernel/vmlinux.lds.S =================================================================== --- linux-2.6-tip.orig/arch/powerpc/kernel/vmlinux.lds.S +++ linux-2.6-tip/arch/powerpc/kernel/vmlinux.lds.S @@ -184,6 +184,7 @@ SECTIONS . = ALIGN(PAGE_SIZE); .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { __per_cpu_start = .; + *(.data.percpu.page_aligned) *(.data.percpu) *(.data.percpu.shared_aligned) __per_cpu_end = .; Index: linux-2.6-tip/arch/powerpc/platforms/Kconfig.cputype =================================================================== --- linux-2.6-tip.orig/arch/powerpc/platforms/Kconfig.cputype +++ linux-2.6-tip/arch/powerpc/platforms/Kconfig.cputype @@ -1,6 +1,7 @@ config PPC64 bool "64-bit kernel" default n + select HAVE_PERF_COUNTERS help This option selects whether a 32-bit or a 64-bit kernel will be built. Index: linux-2.6-tip/arch/powerpc/platforms/cell/interrupt.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/platforms/cell/interrupt.c +++ linux-2.6-tip/arch/powerpc/platforms/cell/interrupt.c @@ -254,7 +254,7 @@ static void handle_iic_irq(unsigned int goto out_eoi; } - kstat_cpu(cpu).irqs[irq]++; + kstat_incr_irqs_this_cpu(irq, desc); /* Mark the IRQ currently in progress.*/ desc->status |= IRQ_INPROGRESS; Index: linux-2.6-tip/arch/powerpc/platforms/cell/spufs/sched.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/platforms/cell/spufs/sched.c +++ linux-2.6-tip/arch/powerpc/platforms/cell/spufs/sched.c @@ -508,7 +508,7 @@ static void __spu_add_to_rq(struct spu_c list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]); set_bit(ctx->prio, spu_prio->bitmap); if (!spu_prio->nr_waiting++) - __mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK); + mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK); } } Index: linux-2.6-tip/arch/powerpc/platforms/pseries/xics.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/platforms/pseries/xics.c +++ linux-2.6-tip/arch/powerpc/platforms/pseries/xics.c @@ -153,9 +153,10 @@ static int get_irq_server(unsigned int v { int server; /* For the moment only implement delivery to all cpus or one cpu */ - cpumask_t cpumask = irq_desc[virq].affinity; + cpumask_t cpumask; cpumask_t tmp = CPU_MASK_NONE; + cpumask_copy(&cpumask, irq_desc[virq].affinity); if (!distribute_irqs) return default_server; @@ -869,7 +870,7 @@ void xics_migrate_irqs_away(void) virq, cpu); /* Reset affinity to all cpus */ - irq_desc[virq].affinity = CPU_MASK_ALL; + cpumask_setall(irq_desc[virq].affinity); desc->chip->set_affinity(virq, cpu_all_mask); unlock: spin_unlock_irqrestore(&desc->lock, flags); Index: linux-2.6-tip/arch/powerpc/sysdev/mpic.c =================================================================== --- linux-2.6-tip.orig/arch/powerpc/sysdev/mpic.c +++ linux-2.6-tip/arch/powerpc/sysdev/mpic.c @@ -566,9 +566,10 @@ static void __init mpic_scan_ht_pics(str #ifdef CONFIG_SMP static int irq_choose_cpu(unsigned int virt_irq) { - cpumask_t mask = irq_desc[virt_irq].affinity; + cpumask_t mask; int cpuid; + cpumask_copy(&mask, irq_desc[virt_irq].affinity); if (cpus_equal(mask, CPU_MASK_ALL)) { static int irq_rover; static DEFINE_SPINLOCK(irq_rover_lock); Index: linux-2.6-tip/arch/sh/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/sh/kernel/irq.c +++ linux-2.6-tip/arch/sh/kernel/irq.c @@ -51,7 +51,7 @@ int show_interrupts(struct seq_file *p, goto unlock; seq_printf(p, "%3d: ",i); for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); seq_printf(p, " %14s", irq_desc[i].chip->name); seq_printf(p, "-%-8s", irq_desc[i].name); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/arch/sparc/kernel/irq_64.c =================================================================== --- linux-2.6-tip.orig/arch/sparc/kernel/irq_64.c +++ linux-2.6-tip/arch/sparc/kernel/irq_64.c @@ -185,7 +185,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %9s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); @@ -252,9 +252,10 @@ struct irq_handler_data { #ifdef CONFIG_SMP static int irq_choose_cpu(unsigned int virt_irq) { - cpumask_t mask = irq_desc[virt_irq].affinity; + cpumask_t mask; int cpuid; + cpumask_copy(&mask, irq_desc[virt_irq].affinity); if (cpus_equal(mask, CPU_MASK_ALL)) { static int irq_rover; static DEFINE_SPINLOCK(irq_rover_lock); @@ -796,7 +797,7 @@ void fixup_irqs(void) !(irq_desc[irq].status & IRQ_PER_CPU)) { if (irq_desc[irq].chip->set_affinity) irq_desc[irq].chip->set_affinity(irq, - &irq_desc[irq].affinity); + irq_desc[irq].affinity); } spin_unlock_irqrestore(&irq_desc[irq].lock, flags); } Index: linux-2.6-tip/arch/sparc/kernel/time_64.c =================================================================== --- linux-2.6-tip.orig/arch/sparc/kernel/time_64.c +++ linux-2.6-tip/arch/sparc/kernel/time_64.c @@ -36,10 +36,10 @@ #include #include #include +#include #include #include -#include #include #include #include @@ -729,7 +729,7 @@ void timer_interrupt(int irq, struct pt_ irq_enter(); - kstat_this_cpu.irqs[0]++; + kstat_incr_irqs_this_cpu(0, irq_to_desc(0)); if (unlikely(!evt->event_handler)) { printk(KERN_WARNING Index: linux-2.6-tip/arch/um/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/um/kernel/irq.c +++ linux-2.6-tip/arch/um/kernel/irq.c @@ -42,7 +42,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/arch/x86/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/Kconfig +++ linux-2.6-tip/arch/x86/Kconfig @@ -5,7 +5,7 @@ mainmenu "Linux Kernel Configuration for config 64BIT bool "64-bit kernel" if ARCH = "x86" default ARCH = "x86_64" - help + ---help--- Say yes to build a 64-bit kernel - formerly known as x86_64 Say no to build a 32-bit kernel - formerly known as i386 @@ -34,8 +34,9 @@ config X86 select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACE_MCOUNT_TEST - select HAVE_KVM if ((X86_32 && !X86_VOYAGER && !X86_VISWS && !X86_NUMAQ) || X86_64) - select HAVE_ARCH_KGDB if !X86_VOYAGER + select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE + select HAVE_KVM + select HAVE_ARCH_KGDB select HAVE_ARCH_TRACEHOOK select HAVE_GENERIC_DMA_COHERENT if X86_32 select HAVE_EFFICIENT_UNALIGNED_ACCESS @@ -108,10 +109,18 @@ config ARCH_MAY_HAVE_PC_FDC def_bool y config RWSEM_GENERIC_SPINLOCK - def_bool !X86_XADD + bool + depends on !X86_XADD || PREEMPT_RT + default y + +config ASM_SEMAPHORES + bool + default y config RWSEM_XCHGADD_ALGORITHM - def_bool X86_XADD + bool + depends on X86_XADD && !RWSEM_GENERIC_SPINLOCK && !PREEMPT_RT + default y config ARCH_HAS_CPU_IDLE_WAIT def_bool y @@ -133,18 +142,16 @@ config ARCH_HAS_CACHE_LINE_SIZE def_bool y config HAVE_SETUP_PER_CPU_AREA - def_bool X86_64_SMP || (X86_SMP && !X86_VOYAGER) + def_bool y config HAVE_CPUMASK_OF_CPU_MAP def_bool X86_64_SMP config ARCH_HIBERNATION_POSSIBLE def_bool y - depends on !SMP || !X86_VOYAGER config ARCH_SUSPEND_POSSIBLE def_bool y - depends on !X86_VOYAGER config ZONE_DMA32 bool @@ -174,11 +181,6 @@ config GENERIC_PENDING_IRQ depends on GENERIC_HARDIRQS && SMP default y -config X86_SMP - bool - depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64) - default y - config USE_GENERIC_SMP_HELPERS def_bool y depends on SMP @@ -194,19 +196,17 @@ config X86_64_SMP config X86_HT bool depends on SMP - depends on (X86_32 && !X86_VOYAGER) || X86_64 - default y - -config X86_BIOS_REBOOT - bool - depends on !X86_VOYAGER default y config X86_TRAMPOLINE bool - depends on X86_SMP || (X86_VOYAGER && SMP) || (64BIT && ACPI_SLEEP) + depends on SMP || (64BIT && ACPI_SLEEP) default y +config X86_32_LAZY_GS + def_bool y + depends on X86_32 && !CC_STACKPROTECTOR + config KTIME_SCALAR def_bool X86_32 source "init/Kconfig" @@ -244,14 +244,24 @@ config SMP If you don't know what to do here, say N. -config X86_HAS_BOOT_CPU_ID - def_bool y - depends on X86_VOYAGER +config X86_X2APIC + bool "Support x2apic" + depends on X86_LOCAL_APIC && X86_64 + ---help--- + This enables x2apic support on CPUs that have this feature. + + This allows 32-bit apic IDs (so it can support very large systems), + and accesses the local apic via MSRs not via mmio. + + ( On certain CPU models you may need to enable INTR_REMAP too, + to get functional x2apic mode. ) + + If you don't know what to do here, say N. config SPARSE_IRQ bool "Support sparse irq numbering" depends on PCI_MSI || HT_IRQ - help + ---help--- This enables support for sparse irqs. This is useful for distro kernels that want to define a high CONFIG_NR_CPUS value but still want to have low kernel memory footprint on smaller machines. @@ -265,114 +275,140 @@ config NUMA_MIGRATE_IRQ_DESC bool "Move irq desc when changing irq smp_affinity" depends on SPARSE_IRQ && NUMA default n - help + ---help--- This enables moving irq_desc to cpu/node that irq will use handled. If you don't know what to do here, say N. -config X86_FIND_SMP_CONFIG - def_bool y - depends on X86_MPPARSE || X86_VOYAGER - config X86_MPPARSE bool "Enable MPS table" if ACPI default y depends on X86_LOCAL_APIC - help + ---help--- For old smp systems that do not have proper acpi support. Newer systems (esp with 64bit cpus) with acpi support, MADT and DSDT will override it -choice - prompt "Subarchitecture Type" - default X86_PC +config X86_BIGSMP + bool "Support for big SMP systems with more than 8 CPUs" + depends on X86_32 && SMP + ---help--- + This option is needed for the systems that have more than 8 CPUs -config X86_PC - bool "PC-compatible" - help - Choose this option if your computer is a standard PC or compatible. +if X86_32 +config X86_EXTENDED_PLATFORM + bool "Support for extended (non-PC) x86 platforms" + default y + ---help--- + If you disable this option then the kernel will only support + standard PC platforms. (which covers the vast majority of + systems out there.) + + If you enable this option then you'll be able to select support + for the following (non-PC) 32 bit x86 platforms: + AMD Elan + NUMAQ (IBM/Sequent) + RDC R-321x SoC + SGI 320/540 (Visual Workstation) + Summit/EXA (IBM x440) + Unisys ES7000 IA32 series + + If you have one of these systems, or if you want to build a + generic distribution kernel, say Y here - otherwise say N. +endif + +if X86_64 +config X86_EXTENDED_PLATFORM + bool "Support for extended (non-PC) x86 platforms" + default y + ---help--- + If you disable this option then the kernel will only support + standard PC platforms. (which covers the vast majority of + systems out there.) + + If you enable this option then you'll be able to select support + for the following (non-PC) 64 bit x86 platforms: + ScaleMP vSMP + SGI Ultraviolet + + If you have one of these systems, or if you want to build a + generic distribution kernel, say Y here - otherwise say N. +endif +# This is an alphabetically sorted list of 64 bit extended platforms +# Please maintain the alphabetic order if and when there are additions + +config X86_VSMP + bool "ScaleMP vSMP" + select PARAVIRT + depends on X86_64 && PCI + depends on X86_EXTENDED_PLATFORM + ---help--- + Support for ScaleMP vSMP systems. Say 'Y' here if this kernel is + supposed to run on these EM64T-based machines. Only choose this option + if you have one of these machines. + +config X86_UV + bool "SGI Ultraviolet" + depends on X86_64 + depends on X86_EXTENDED_PLATFORM + select X86_X2APIC + ---help--- + This option is needed in order to support SGI Ultraviolet systems. + If you don't have one of these, you should say N here. + +# Following is an alphabetically sorted list of 32 bit extended platforms +# Please maintain the alphabetic order if and when there are additions config X86_ELAN bool "AMD Elan" depends on X86_32 - help + depends on X86_EXTENDED_PLATFORM + ---help--- Select this for an AMD Elan processor. Do not use this option for K6/Athlon/Opteron processors! If unsure, choose "PC-compatible" instead. -config X86_VOYAGER - bool "Voyager (NCR)" - depends on X86_32 && (SMP || BROKEN) && !PCI - help - Voyager is an MCA-based 32-way capable SMP architecture proprietary - to NCR Corp. Machine classes 345x/35xx/4100/51xx are Voyager-based. - - *** WARNING *** - - If you do not specifically know you have a Voyager based machine, - say N here, otherwise the kernel you build will not be bootable. - -config X86_GENERICARCH - bool "Generic architecture" +config X86_RDC321X + bool "RDC R-321x SoC" depends on X86_32 - help - This option compiles in the NUMAQ, Summit, bigsmp, ES7000, default + depends on X86_EXTENDED_PLATFORM + select M486 + select X86_REBOOTFIXUPS + ---help--- + This option is needed for RDC R-321x system-on-chip, also known + as R-8610-(G). + If you don't have one of these chips, you should say N here. + +config X86_32_NON_STANDARD + bool "Support non-standard 32-bit SMP architectures" + depends on X86_32 && SMP + depends on X86_EXTENDED_PLATFORM + ---help--- + This option compiles in the NUMAQ, Summit, bigsmp, ES7000, default subarchitectures. It is intended for a generic binary kernel. if you select them all, kernel will probe it one by one. and will fallback to default. -if X86_GENERICARCH +# Alphabetically sorted list of Non standard 32 bit platforms config X86_NUMAQ bool "NUMAQ (IBM/Sequent)" - depends on SMP && X86_32 && PCI && X86_MPPARSE + depends on X86_32_NON_STANDARD select NUMA - help + select X86_MPPARSE + ---help--- This option is used for getting Linux to run on a NUMAQ (IBM/Sequent) NUMA multiquad box. This changes the way that processors are bootstrapped, and uses Clustered Logical APIC addressing mode instead of Flat Logical. You will need a new lynxer.elf file to flash your firmware with - send email to . -config X86_SUMMIT - bool "Summit/EXA (IBM x440)" - depends on X86_32 && SMP - help - This option is needed for IBM systems that use the Summit/EXA chipset. - In particular, it is needed for the x440. - -config X86_ES7000 - bool "Support for Unisys ES7000 IA32 series" - depends on X86_32 && SMP - help - Support for Unisys ES7000 systems. Say 'Y' here if this kernel is - supposed to run on an IA32-based Unisys ES7000 system. - -config X86_BIGSMP - bool "Support for big SMP systems with more than 8 CPUs" - depends on X86_32 && SMP - help - This option is needed for the systems that have more than 8 CPUs - and if the system is not of any sub-arch type above. - -endif - -config X86_VSMP - bool "Support for ScaleMP vSMP" - select PARAVIRT - depends on X86_64 && PCI - help - Support for ScaleMP vSMP systems. Say 'Y' here if this kernel is - supposed to run on these EM64T-based machines. Only choose this option - if you have one of these machines. - -endchoice - config X86_VISWS bool "SGI 320/540 (Visual Workstation)" - depends on X86_32 && PCI && !X86_VOYAGER && X86_MPPARSE && PCI_GODIRECT - help + depends on X86_32 && PCI && X86_MPPARSE && PCI_GODIRECT + depends on X86_32_NON_STANDARD + ---help--- The SGI Visual Workstation series is an IA32-based workstation based on SGI systems chips with some legacy PC hardware attached. @@ -381,21 +417,25 @@ config X86_VISWS A kernel compiled for the Visual Workstation will run on general PCs as well. See for details. -config X86_RDC321X - bool "RDC R-321x SoC" - depends on X86_32 - select M486 - select X86_REBOOTFIXUPS - help - This option is needed for RDC R-321x system-on-chip, also known - as R-8610-(G). - If you don't have one of these chips, you should say N here. +config X86_SUMMIT + bool "Summit/EXA (IBM x440)" + depends on X86_32_NON_STANDARD + ---help--- + This option is needed for IBM systems that use the Summit/EXA chipset. + In particular, it is needed for the x440. + +config X86_ES7000 + bool "Unisys ES7000 IA32 series" + depends on X86_32_NON_STANDARD && X86_BIGSMP + ---help--- + Support for Unisys ES7000 systems. Say 'Y' here if this kernel is + supposed to run on an IA32-based Unisys ES7000 system. config SCHED_OMIT_FRAME_POINTER def_bool y prompt "Single-depth WCHAN output" depends on X86 - help + ---help--- Calculate simpler /proc//wchan values. If this option is disabled then wchan values will recurse back to the caller function. This provides more accurate wchan values, @@ -405,7 +445,7 @@ config SCHED_OMIT_FRAME_POINTER menuconfig PARAVIRT_GUEST bool "Paravirtualized guest support" - help + ---help--- Say Y here to get to see options related to running Linux under various hypervisors. This option alone does not add any kernel code. @@ -419,8 +459,7 @@ config VMI bool "VMI Guest support" select PARAVIRT depends on X86_32 - depends on !X86_VOYAGER - help + ---help--- VMI provides a paravirtualized interface to the VMware ESX server (it could be used by other hypervisors in theory too, but is not at the moment), by linking the kernel to a GPL-ed ROM module @@ -430,8 +469,7 @@ config KVM_CLOCK bool "KVM paravirtualized clock" select PARAVIRT select PARAVIRT_CLOCK - depends on !X86_VOYAGER - help + ---help--- Turning on this option will allow you to run a paravirtualized clock when running over the KVM hypervisor. Instead of relying on a PIT (or probably other) emulation by the underlying device model, the host @@ -441,17 +479,15 @@ config KVM_CLOCK config KVM_GUEST bool "KVM Guest support" select PARAVIRT - depends on !X86_VOYAGER - help - This option enables various optimizations for running under the KVM - hypervisor. + ---help--- + This option enables various optimizations for running under the KVM + hypervisor. source "arch/x86/lguest/Kconfig" config PARAVIRT bool "Enable paravirtualization code" - depends on !X86_VOYAGER - help + ---help--- This changes the kernel so it can modify itself when it is run under a hypervisor, potentially improving performance significantly over full virtualization. However, when run without a hypervisor @@ -464,51 +500,51 @@ config PARAVIRT_CLOCK endif config PARAVIRT_DEBUG - bool "paravirt-ops debugging" - depends on PARAVIRT && DEBUG_KERNEL - help - Enable to debug paravirt_ops internals. Specifically, BUG if - a paravirt_op is missing when it is called. + bool "paravirt-ops debugging" + depends on PARAVIRT && DEBUG_KERNEL + ---help--- + Enable to debug paravirt_ops internals. Specifically, BUG if + a paravirt_op is missing when it is called. config MEMTEST bool "Memtest" - help + ---help--- This option adds a kernel parameter 'memtest', which allows memtest to be set. - memtest=0, mean disabled; -- default - memtest=1, mean do 1 test pattern; - ... - memtest=4, mean do 4 test patterns. + memtest=0, mean disabled; -- default + memtest=1, mean do 1 test pattern; + ... + memtest=4, mean do 4 test patterns. If you are unsure how to answer this question, answer N. config X86_SUMMIT_NUMA def_bool y - depends on X86_32 && NUMA && X86_GENERICARCH + depends on X86_32 && NUMA && X86_32_NON_STANDARD config X86_CYCLONE_TIMER def_bool y - depends on X86_GENERICARCH + depends on X86_32_NON_STANDARD source "arch/x86/Kconfig.cpu" config HPET_TIMER def_bool X86_64 prompt "HPET Timer Support" if X86_32 - help - Use the IA-PC HPET (High Precision Event Timer) to manage - time in preference to the PIT and RTC, if a HPET is - present. - HPET is the next generation timer replacing legacy 8254s. - The HPET provides a stable time base on SMP - systems, unlike the TSC, but it is more expensive to access, - as it is off-chip. You can find the HPET spec at - . - - You can safely choose Y here. However, HPET will only be - activated if the platform and the BIOS support this feature. - Otherwise the 8254 will be used for timing services. + ---help--- + Use the IA-PC HPET (High Precision Event Timer) to manage + time in preference to the PIT and RTC, if a HPET is + present. + HPET is the next generation timer replacing legacy 8254s. + The HPET provides a stable time base on SMP + systems, unlike the TSC, but it is more expensive to access, + as it is off-chip. You can find the HPET spec at + . + + You can safely choose Y here. However, HPET will only be + activated if the platform and the BIOS support this feature. + Otherwise the 8254 will be used for timing services. - Choose N to continue using the legacy 8254 timer. + Choose N to continue using the legacy 8254 timer. config HPET_EMULATE_RTC def_bool y @@ -519,7 +555,7 @@ config HPET_EMULATE_RTC config DMI default y bool "Enable DMI scanning" if EMBEDDED - help + ---help--- Enabled scanning of DMI to identify machine quirks. Say Y here unless you have verified that your setup is not affected by entries in the DMI blacklist. Required by PNP @@ -531,7 +567,7 @@ config GART_IOMMU select SWIOTLB select AGP depends on X86_64 && PCI - help + ---help--- Support for full DMA access of devices with 32bit memory access only on systems with more than 3GB. This is usually needed for USB, sound, many IDE/SATA chipsets and some other devices. @@ -546,7 +582,7 @@ config CALGARY_IOMMU bool "IBM Calgary IOMMU support" select SWIOTLB depends on X86_64 && PCI && EXPERIMENTAL - help + ---help--- Support for hardware IOMMUs in IBM's xSeries x366 and x460 systems. Needed to run systems with more than 3GB of memory properly with 32-bit PCI devices that do not support DAC @@ -564,7 +600,7 @@ config CALGARY_IOMMU_ENABLED_BY_DEFAULT def_bool y prompt "Should Calgary be enabled by default?" depends on CALGARY_IOMMU - help + ---help--- Should Calgary be enabled by default? if you choose 'y', Calgary will be used (if it exists). If you choose 'n', Calgary will not be used even if it exists. If you choose 'n' and would like to use @@ -576,7 +612,7 @@ config AMD_IOMMU select SWIOTLB select PCI_MSI depends on X86_64 && PCI && ACPI - help + ---help--- With this option you can enable support for AMD IOMMU hardware in your system. An IOMMU is a hardware component which provides remapping of DMA memory accesses from devices. With an AMD IOMMU you @@ -591,7 +627,7 @@ config AMD_IOMMU_STATS bool "Export AMD IOMMU statistics to debugfs" depends on AMD_IOMMU select DEBUG_FS - help + ---help--- This option enables code in the AMD IOMMU driver to collect various statistics about whats happening in the driver and exports that information to userspace via debugfs. @@ -600,7 +636,7 @@ config AMD_IOMMU_STATS # need this always selected by IOMMU for the VIA workaround config SWIOTLB def_bool y if X86_64 - help + ---help--- Support for software bounce buffers used on x86-64 systems which don't have a hardware IOMMU (e.g. the current generation of Intel's x86-64 CPUs). Using this PCI devices which can only @@ -618,7 +654,7 @@ config MAXSMP depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL select CPUMASK_OFFSTACK default n - help + ---help--- Configure maximum number of CPUS and NUMA Nodes for this architecture. If unsure, say N. @@ -629,7 +665,7 @@ config NR_CPUS default "4096" if MAXSMP default "32" if SMP && (X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000) default "8" if SMP - help + ---help--- This allows you to specify the maximum number of CPUs which this kernel will support. The maximum supported value is 512 and the minimum value which makes sense is 2. @@ -640,7 +676,7 @@ config NR_CPUS config SCHED_SMT bool "SMT (Hyperthreading) scheduler support" depends on X86_HT - help + ---help--- SMT scheduler support improves the CPU scheduler's decision making when dealing with Intel Pentium 4 chips with HyperThreading at a cost of slightly increased overhead in some places. If unsure say @@ -650,7 +686,7 @@ config SCHED_MC def_bool y prompt "Multi-core scheduler support" depends on X86_HT - help + ---help--- Multi-core scheduler support improves the CPU scheduler's decision making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here. @@ -659,8 +695,8 @@ source "kernel/Kconfig.preempt" config X86_UP_APIC bool "Local APIC support on uniprocessors" - depends on X86_32 && !SMP && !(X86_VOYAGER || X86_GENERICARCH) - help + depends on X86_32 && !SMP && !X86_32_NON_STANDARD + ---help--- A local APIC (Advanced Programmable Interrupt Controller) is an integrated interrupt controller in the CPU. If you have a single-CPU system which has a processor with a local APIC, you can say Y here to @@ -673,7 +709,7 @@ config X86_UP_APIC config X86_UP_IOAPIC bool "IO-APIC support on uniprocessors" depends on X86_UP_APIC - help + ---help--- An IO-APIC (I/O Advanced Programmable Interrupt Controller) is an SMP-capable replacement for PC-style interrupt controllers. Most SMP systems and many recent uniprocessor systems have one. @@ -684,11 +720,12 @@ config X86_UP_IOAPIC config X86_LOCAL_APIC def_bool y - depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH)) + depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC + select HAVE_PERF_COUNTERS if (!M386 && !M486) config X86_IO_APIC def_bool y - depends on X86_64 || (X86_32 && (X86_UP_IOAPIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH)) + depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC config X86_VISWS_APIC def_bool y @@ -698,7 +735,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS bool "Reroute for broken boot IRQs" default n depends on X86_IO_APIC - help + ---help--- This option enables a workaround that fixes a source of spurious interrupts. This is recommended when threaded interrupt handling is used on systems where the generation of @@ -720,7 +757,6 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS config X86_MCE bool "Machine Check Exception" - depends on !X86_VOYAGER ---help--- Machine Check Exception support allows the processor to notify the kernel if it detects a problem (e.g. overheating, component failure). @@ -739,7 +775,7 @@ config X86_MCE_INTEL def_bool y prompt "Intel MCE features" depends on X86_64 && X86_MCE && X86_LOCAL_APIC - help + ---help--- Additional support for intel specific MCE features such as the thermal monitor. @@ -747,14 +783,14 @@ config X86_MCE_AMD def_bool y prompt "AMD MCE features" depends on X86_64 && X86_MCE && X86_LOCAL_APIC - help + ---help--- Additional support for AMD specific MCE features such as the DRAM Error Threshold. config X86_MCE_NONFATAL tristate "Check for non-fatal errors on AMD Athlon/Duron / Intel Pentium 4" depends on X86_32 && X86_MCE - help + ---help--- Enabling this feature starts a timer that triggers every 5 seconds which will look at the machine check registers to see if anything happened. Non-fatal problems automatically get corrected (but still logged). @@ -767,7 +803,7 @@ config X86_MCE_NONFATAL config X86_MCE_P4THERMAL bool "check for P4 thermal throttling interrupt." depends on X86_32 && X86_MCE && (X86_UP_APIC || SMP) - help + ---help--- Enabling this feature will cause a message to be printed when the P4 enters thermal throttling. @@ -775,11 +811,11 @@ config VM86 bool "Enable VM86 support" if EMBEDDED default y depends on X86_32 - help - This option is required by programs like DOSEMU to run 16-bit legacy + ---help--- + This option is required by programs like DOSEMU to run 16-bit legacy code on X86 processors. It also may be needed by software like - XFree86 to initialize some video cards via BIOS. Disabling this - option saves about 6k. + XFree86 to initialize some video cards via BIOS. Disabling this + option saves about 6k. config TOSHIBA tristate "Toshiba Laptop support" @@ -853,33 +889,33 @@ config MICROCODE module will be called microcode. config MICROCODE_INTEL - bool "Intel microcode patch loading support" - depends on MICROCODE - default MICROCODE - select FW_LOADER - --help--- - This options enables microcode patch loading support for Intel - processors. - - For latest news and information on obtaining all the required - Intel ingredients for this driver, check: - . + bool "Intel microcode patch loading support" + depends on MICROCODE + default MICROCODE + select FW_LOADER + ---help--- + This options enables microcode patch loading support for Intel + processors. + + For latest news and information on obtaining all the required + Intel ingredients for this driver, check: + . config MICROCODE_AMD - bool "AMD microcode patch loading support" - depends on MICROCODE - select FW_LOADER - --help--- - If you select this option, microcode patch loading support for AMD - processors will be enabled. + bool "AMD microcode patch loading support" + depends on MICROCODE + select FW_LOADER + ---help--- + If you select this option, microcode patch loading support for AMD + processors will be enabled. - config MICROCODE_OLD_INTERFACE +config MICROCODE_OLD_INTERFACE def_bool y depends on MICROCODE config X86_MSR tristate "/dev/cpu/*/msr - Model-specific register support" - help + ---help--- This device gives privileged processes access to the x86 Model-Specific Registers (MSRs). It is a character device with major 202 and minors 0 to 31 for /dev/cpu/0/msr to /dev/cpu/31/msr. @@ -888,7 +924,7 @@ config X86_MSR config X86_CPUID tristate "/dev/cpu/*/cpuid - CPU information support" - help + ---help--- This device gives processes access to the x86 CPUID instruction to be executed on a specific processor. It is a character device with major 203 and minors 0 to 31 for /dev/cpu/0/cpuid to @@ -940,7 +976,7 @@ config NOHIGHMEM config HIGHMEM4G bool "4GB" depends on !X86_NUMAQ - help + ---help--- Select this if you have a 32-bit processor and between 1 and 4 gigabytes of physical RAM. @@ -948,7 +984,7 @@ config HIGHMEM64G bool "64GB" depends on !M386 && !M486 select X86_PAE - help + ---help--- Select this if you have a 32-bit processor and more than 4 gigabytes of physical RAM. @@ -959,7 +995,7 @@ choice prompt "Memory split" if EMBEDDED default VMSPLIT_3G depends on X86_32 - help + ---help--- Select the desired split between kernel and user memory. If the address range available to the kernel is less than the @@ -1005,20 +1041,20 @@ config HIGHMEM config X86_PAE bool "PAE (Physical Address Extension) Support" depends on X86_32 && !HIGHMEM4G - help + ---help--- PAE is required for NX support, and furthermore enables larger swapspace support for non-overcommit purposes. It has the cost of more pagetable lookup overhead, and also consumes more pagetable space per process. config ARCH_PHYS_ADDR_T_64BIT - def_bool X86_64 || X86_PAE + def_bool X86_64 || X86_PAE config DIRECT_GBPAGES bool "Enable 1GB pages for kernel pagetables" if EMBEDDED default y depends on X86_64 - help + ---help--- Allow the kernel linear mapping to use 1GB pages on CPUs that support it. This can improve the kernel's performance a tiny bit by reducing TLB pressure. If in doubt, say "Y". @@ -1028,9 +1064,8 @@ config NUMA bool "Numa Memory Allocation and Scheduler Support" depends on SMP depends on X86_64 || (X86_32 && HIGHMEM64G && (X86_NUMAQ || X86_BIGSMP || X86_SUMMIT && ACPI) && EXPERIMENTAL) - default n if X86_PC default y if (X86_NUMAQ || X86_SUMMIT || X86_BIGSMP) - help + ---help--- Enable NUMA (Non Uniform Memory Access) support. The kernel will try to allocate memory used by a CPU on the @@ -1053,19 +1088,19 @@ config K8_NUMA def_bool y prompt "Old style AMD Opteron NUMA detection" depends on X86_64 && NUMA && PCI - help - Enable K8 NUMA node topology detection. You should say Y here if - you have a multi processor AMD K8 system. This uses an old - method to read the NUMA configuration directly from the builtin - Northbridge of Opteron. It is recommended to use X86_64_ACPI_NUMA - instead, which also takes priority if both are compiled in. + ---help--- + Enable K8 NUMA node topology detection. You should say Y here if + you have a multi processor AMD K8 system. This uses an old + method to read the NUMA configuration directly from the builtin + Northbridge of Opteron. It is recommended to use X86_64_ACPI_NUMA + instead, which also takes priority if both are compiled in. config X86_64_ACPI_NUMA def_bool y prompt "ACPI NUMA detection" depends on X86_64 && NUMA && ACPI && PCI select ACPI_NUMA - help + ---help--- Enable ACPI SRAT based node topology detection. # Some NUMA nodes have memory ranges that span @@ -1080,7 +1115,7 @@ config NODES_SPAN_OTHER_NODES config NUMA_EMU bool "NUMA emulation" depends on X86_64 && NUMA - help + ---help--- Enable NUMA emulation. A flat machine will be split into virtual nodes when booted with "numa=fake=N", where N is the number of nodes. This is only useful for debugging. @@ -1093,7 +1128,7 @@ config NODES_SHIFT default "4" if X86_NUMAQ default "3" depends on NEED_MULTIPLE_NODES - help + ---help--- Specify the maximum number of NUMA Nodes available on the target system. Increases memory reserved to accomodate various tables. @@ -1131,7 +1166,7 @@ config ARCH_SPARSEMEM_DEFAULT config ARCH_SPARSEMEM_ENABLE def_bool y - depends on X86_64 || NUMA || (EXPERIMENTAL && X86_PC) || X86_GENERICARCH + depends on X86_64 || NUMA || (EXPERIMENTAL && X86_32) || X86_32_NON_STANDARD select SPARSEMEM_STATIC if X86_32 select SPARSEMEM_VMEMMAP_ENABLE if X86_64 @@ -1143,66 +1178,71 @@ config ARCH_MEMORY_PROBE def_bool X86_64 depends on MEMORY_HOTPLUG +config ILLEGAL_POINTER_VALUE + hex + default 0 if X86_32 + default 0xdead000000000000 if X86_64 + source "mm/Kconfig" config HIGHPTE bool "Allocate 3rd-level pagetables from highmem" depends on X86_32 && (HIGHMEM4G || HIGHMEM64G) - help + ---help--- The VM uses one page table entry for each page of physical memory. For systems with a lot of RAM, this can be wasteful of precious low memory. Setting this option will put user-space page table entries in high memory. config X86_CHECK_BIOS_CORRUPTION - bool "Check for low memory corruption" - help - Periodically check for memory corruption in low memory, which - is suspected to be caused by BIOS. Even when enabled in the - configuration, it is disabled at runtime. Enable it by - setting "memory_corruption_check=1" on the kernel command - line. By default it scans the low 64k of memory every 60 - seconds; see the memory_corruption_check_size and - memory_corruption_check_period parameters in - Documentation/kernel-parameters.txt to adjust this. - - When enabled with the default parameters, this option has - almost no overhead, as it reserves a relatively small amount - of memory and scans it infrequently. It both detects corruption - and prevents it from affecting the running system. - - It is, however, intended as a diagnostic tool; if repeatable - BIOS-originated corruption always affects the same memory, - you can use memmap= to prevent the kernel from using that - memory. + bool "Check for low memory corruption" + ---help--- + Periodically check for memory corruption in low memory, which + is suspected to be caused by BIOS. Even when enabled in the + configuration, it is disabled at runtime. Enable it by + setting "memory_corruption_check=1" on the kernel command + line. By default it scans the low 64k of memory every 60 + seconds; see the memory_corruption_check_size and + memory_corruption_check_period parameters in + Documentation/kernel-parameters.txt to adjust this. + + When enabled with the default parameters, this option has + almost no overhead, as it reserves a relatively small amount + of memory and scans it infrequently. It both detects corruption + and prevents it from affecting the running system. + + It is, however, intended as a diagnostic tool; if repeatable + BIOS-originated corruption always affects the same memory, + you can use memmap= to prevent the kernel from using that + memory. config X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK - bool "Set the default setting of memory_corruption_check" + bool "Set the default setting of memory_corruption_check" depends on X86_CHECK_BIOS_CORRUPTION default y - help - Set whether the default state of memory_corruption_check is - on or off. + ---help--- + Set whether the default state of memory_corruption_check is + on or off. config X86_RESERVE_LOW_64K - bool "Reserve low 64K of RAM on AMI/Phoenix BIOSen" + bool "Reserve low 64K of RAM on AMI/Phoenix BIOSen" default y - help - Reserve the first 64K of physical RAM on BIOSes that are known - to potentially corrupt that memory range. A numbers of BIOSes are - known to utilize this area during suspend/resume, so it must not - be used by the kernel. - - Set this to N if you are absolutely sure that you trust the BIOS - to get all its memory reservations and usages right. - - If you have doubts about the BIOS (e.g. suspend/resume does not - work or there's kernel crashes after certain hardware hotplug - events) and it's not AMI or Phoenix, then you might want to enable - X86_CHECK_BIOS_CORRUPTION=y to allow the kernel to check typical - corruption patterns. + ---help--- + Reserve the first 64K of physical RAM on BIOSes that are known + to potentially corrupt that memory range. A numbers of BIOSes are + known to utilize this area during suspend/resume, so it must not + be used by the kernel. + + Set this to N if you are absolutely sure that you trust the BIOS + to get all its memory reservations and usages right. + + If you have doubts about the BIOS (e.g. suspend/resume does not + work or there's kernel crashes after certain hardware hotplug + events) and it's not AMI or Phoenix, then you might want to enable + X86_CHECK_BIOS_CORRUPTION=y to allow the kernel to check typical + corruption patterns. - Say Y if unsure. + Say Y if unsure. config MATH_EMULATION bool @@ -1268,7 +1308,7 @@ config MTRR_SANITIZER def_bool y prompt "MTRR cleanup support" depends on MTRR - help + ---help--- Convert MTRR layout from continuous to discrete, so X drivers can add writeback entries. @@ -1283,7 +1323,7 @@ config MTRR_SANITIZER_ENABLE_DEFAULT range 0 1 default "0" depends on MTRR_SANITIZER - help + ---help--- Enable mtrr cleanup default value config MTRR_SANITIZER_SPARE_REG_NR_DEFAULT @@ -1291,7 +1331,7 @@ config MTRR_SANITIZER_SPARE_REG_NR_DEFAU range 0 7 default "1" depends on MTRR_SANITIZER - help + ---help--- mtrr cleanup spare entries default, it can be changed via mtrr_spare_reg_nr=N on the kernel command line. @@ -1299,7 +1339,7 @@ config X86_PAT bool prompt "x86 PAT support" depends on MTRR - help + ---help--- Use PAT attributes to setup page level cache control. PATs are the modern equivalents of MTRRs and are much more @@ -1314,20 +1354,20 @@ config EFI bool "EFI runtime service support" depends on ACPI ---help--- - This enables the kernel to use EFI runtime services that are - available (such as the EFI variable services). + This enables the kernel to use EFI runtime services that are + available (such as the EFI variable services). - This option is only useful on systems that have EFI firmware. - In addition, you should use the latest ELILO loader available - at in order to take advantage - of EFI runtime services. However, even with this option, the - resultant kernel should continue to boot on existing non-EFI - platforms. + This option is only useful on systems that have EFI firmware. + In addition, you should use the latest ELILO loader available + at in order to take advantage + of EFI runtime services. However, even with this option, the + resultant kernel should continue to boot on existing non-EFI + platforms. config SECCOMP def_bool y prompt "Enable seccomp to safely compute untrusted bytecode" - help + ---help--- This kernel feature is useful for number crunching applications that may need to compute untrusted bytecode during their execution. By using pipes or other transports made available to @@ -1340,13 +1380,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config CC_STACKPROTECTOR_ALL + bool + config CC_STACKPROTECTOR bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" - depends on X86_64 && EXPERIMENTAL && BROKEN - help - This option turns on the -fstack-protector GCC feature. This - feature puts, at the beginning of critical functions, a canary - value on the stack just before the return address, and validates + select CC_STACKPROTECTOR_ALL + ---help--- + This option turns on the -fstack-protector GCC feature. This + feature puts, at the beginning of functions, a canary value on + the stack just before the return address, and validates the value just before actually returning. Stack based buffer overflows (that need to overwrite this return address) now also overwrite the canary, which gets detected and the attack is then @@ -1354,22 +1397,14 @@ config CC_STACKPROTECTOR This feature requires gcc version 4.2 or above, or a distribution gcc with the feature backported. Older versions are automatically - detected and for those versions, this configuration option is ignored. - -config CC_STACKPROTECTOR_ALL - bool "Use stack-protector for all functions" - depends on CC_STACKPROTECTOR - help - Normally, GCC only inserts the canary value protection for - functions that use large-ish on-stack buffers. By enabling - this option, GCC will be asked to do this for ALL functions. + detected and for those versions, this configuration option is + ignored. (and a warning is printed during bootup) source kernel/Kconfig.hz config KEXEC bool "kexec system call" - depends on X86_BIOS_REBOOT - help + ---help--- kexec is a system call that implements the ability to shutdown your current kernel, and to start another kernel. It is like a reboot but it is independent of the system firmware. And like a reboot @@ -1386,7 +1421,7 @@ config KEXEC config CRASH_DUMP bool "kernel crash dumps" depends on X86_64 || (X86_32 && HIGHMEM) - help + ---help--- Generate crash dump after being started by kexec. This should be normally only set in special crash dump kernels which are loaded in the main kernel with kexec-tools into @@ -1401,7 +1436,7 @@ config KEXEC_JUMP bool "kexec jump (EXPERIMENTAL)" depends on EXPERIMENTAL depends on KEXEC && HIBERNATION && X86_32 - help + ---help--- Jump between original kernel and kexeced kernel and invoke code in physical address mode via KEXEC @@ -1410,7 +1445,7 @@ config PHYSICAL_START default "0x1000000" if X86_NUMAQ default "0x200000" if X86_64 default "0x100000" - help + ---help--- This gives the physical address where the kernel is loaded. If kernel is a not relocatable (CONFIG_RELOCATABLE=n) then @@ -1451,7 +1486,7 @@ config PHYSICAL_START config RELOCATABLE bool "Build a relocatable kernel (EXPERIMENTAL)" depends on EXPERIMENTAL - help + ---help--- This builds a kernel image that retains relocation information so it can be loaded someplace besides the default 1MB. The relocations tend to make the kernel binary about 10% larger, @@ -1471,7 +1506,7 @@ config PHYSICAL_ALIGN default "0x100000" if X86_32 default "0x200000" if X86_64 range 0x2000 0x400000 - help + ---help--- This value puts the alignment restrictions on physical address where kernel is loaded and run from. Kernel is compiled for an address which meets above alignment restriction. @@ -1492,7 +1527,7 @@ config PHYSICAL_ALIGN config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" - depends on SMP && HOTPLUG && !X86_VOYAGER + depends on SMP && HOTPLUG ---help--- Say Y here to allow turning CPUs off and on. CPUs can be controlled through /sys/devices/system/cpu. @@ -1504,7 +1539,7 @@ config COMPAT_VDSO def_bool y prompt "Compat VDSO support" depends on X86_32 || IA32_EMULATION - help + ---help--- Map the 32-bit VDSO to the predictable old-style address too. ---help--- Say N here if you are running a sufficiently recent glibc @@ -1516,7 +1551,7 @@ config COMPAT_VDSO config CMDLINE_BOOL bool "Built-in kernel command line" default n - help + ---help--- Allow for specifying boot arguments to the kernel at build time. On some systems (e.g. embedded ones), it is necessary or convenient to provide some or all of the @@ -1534,7 +1569,7 @@ config CMDLINE string "Built-in kernel command string" depends on CMDLINE_BOOL default "" - help + ---help--- Enter arguments here that should be compiled into the kernel image and used at boot time. If the boot loader provides a command line at boot time, it is appended to this string to @@ -1551,7 +1586,7 @@ config CMDLINE_OVERRIDE bool "Built-in command line overrides boot loader arguments" default n depends on CMDLINE_BOOL - help + ---help--- Set this option to 'Y' to have the kernel ignore the boot loader command line, and use ONLY the built-in command line. @@ -1572,8 +1607,11 @@ config HAVE_ARCH_EARLY_PFN_TO_NID def_bool X86_64 depends on NUMA +config HARDIRQS_SW_RESEND + bool + default y + menu "Power management and ACPI options" - depends on !X86_VOYAGER config ARCH_HIBERNATION_HEADER def_bool y @@ -1651,7 +1689,7 @@ if APM config APM_IGNORE_USER_SUSPEND bool "Ignore USER SUSPEND" - help + ---help--- This option will ignore USER SUSPEND requests. On machines with a compliant APM BIOS, you want to say N. However, on the NEC Versa M series notebooks, it is necessary to say Y because of a BIOS bug. @@ -1675,7 +1713,7 @@ config APM_DO_ENABLE config APM_CPU_IDLE bool "Make CPU Idle calls when idle" - help + ---help--- Enable calls to APM CPU Idle/CPU Busy inside the kernel's idle loop. On some machines, this can activate improved power savings, such as a slowed CPU clock rate, when the machine is idle. These idle calls @@ -1686,7 +1724,7 @@ config APM_CPU_IDLE config APM_DISPLAY_BLANK bool "Enable console blanking using APM" - help + ---help--- Enable console blanking using the APM. Some laptops can use this to turn off the LCD backlight when the screen blanker of the Linux virtual console blanks the screen. Note that this is only used by @@ -1699,7 +1737,7 @@ config APM_DISPLAY_BLANK config APM_ALLOW_INTS bool "Allow interrupts during APM BIOS calls" - help + ---help--- Normally we disable external interrupts while we are making calls to the APM BIOS as a measure to lessen the effects of a badly behaving BIOS implementation. The BIOS should reenable interrupts if it @@ -1724,7 +1762,7 @@ config PCI bool "PCI support" default y select ARCH_SUPPORTS_MSI if (X86_LOCAL_APIC && X86_IO_APIC) - help + ---help--- Find out whether you have a PCI motherboard. PCI is the name of a bus system, i.e. the way the CPU talks to the other stuff inside your box. Other bus systems are ISA, EISA, MicroChannel (MCA) or @@ -1795,7 +1833,7 @@ config PCI_MMCONFIG config DMAR bool "Support for DMA Remapping Devices (EXPERIMENTAL)" depends on X86_64 && PCI_MSI && ACPI && EXPERIMENTAL - help + ---help--- DMA remapping (DMAR) devices support enables independent address translations for Direct Memory Access (DMA) from devices. These DMA remapping devices are reported via ACPI tables @@ -1817,29 +1855,30 @@ config DMAR_GFX_WA def_bool y prompt "Support for Graphics workaround" depends on DMAR - help - Current Graphics drivers tend to use physical address - for DMA and avoid using DMA APIs. Setting this config - option permits the IOMMU driver to set a unity map for - all the OS-visible memory. Hence the driver can continue - to use physical addresses for DMA. + ---help--- + Current Graphics drivers tend to use physical address + for DMA and avoid using DMA APIs. Setting this config + option permits the IOMMU driver to set a unity map for + all the OS-visible memory. Hence the driver can continue + to use physical addresses for DMA. config DMAR_FLOPPY_WA def_bool y depends on DMAR - help - Floppy disk drivers are know to bypass DMA API calls - thereby failing to work when IOMMU is enabled. This - workaround will setup a 1:1 mapping for the first - 16M to make floppy (an ISA device) work. + ---help--- + Floppy disk drivers are know to bypass DMA API calls + thereby failing to work when IOMMU is enabled. This + workaround will setup a 1:1 mapping for the first + 16M to make floppy (an ISA device) work. config INTR_REMAP bool "Support for Interrupt Remapping (EXPERIMENTAL)" depends on X86_64 && X86_IO_APIC && PCI_MSI && ACPI && EXPERIMENTAL - help - Supports Interrupt remapping for IO-APIC and MSI devices. - To use x2apic mode in the CPU's which support x2APIC enhancements or - to support platforms with CPU's having > 8 bit APIC ID, say Y. + select X86_X2APIC + ---help--- + Supports Interrupt remapping for IO-APIC and MSI devices. + To use x2apic mode in the CPU's which support x2APIC enhancements or + to support platforms with CPU's having > 8 bit APIC ID, say Y. source "drivers/pci/pcie/Kconfig" @@ -1853,8 +1892,7 @@ if X86_32 config ISA bool "ISA support" - depends on !X86_VOYAGER - help + ---help--- Find out whether you have ISA slots on your motherboard. ISA is the name of a bus system, i.e. the way the CPU talks to the other stuff inside your box. Other bus systems are PCI, EISA, MicroChannel @@ -1880,9 +1918,8 @@ config EISA source "drivers/eisa/Kconfig" config MCA - bool "MCA support" if !X86_VOYAGER - default y if X86_VOYAGER - help + bool "MCA support" + ---help--- MicroChannel Architecture is found in some IBM PS/2 machines and laptops. It is a bus system similar to PCI or ISA. See (and especially the web page given @@ -1892,8 +1929,7 @@ source "drivers/mca/Kconfig" config SCx200 tristate "NatSemi SCx200 support" - depends on !X86_VOYAGER - help + ---help--- This provides basic support for National Semiconductor's (now AMD's) Geode processors. The driver probes for the PCI-IDs of several on-chip devices, so its a good dependency @@ -1905,7 +1941,7 @@ config SCx200HR_TIMER tristate "NatSemi SCx200 27MHz High-Resolution Timer Support" depends on SCx200 && GENERIC_TIME default y - help + ---help--- This driver provides a clocksource built upon the on-chip 27MHz high-resolution timer. Its also a workaround for NSC Geode SC-1100's buggy TSC, which loses time when the @@ -1916,7 +1952,7 @@ config GEODE_MFGPT_TIMER def_bool y prompt "Geode Multi-Function General Purpose Timer (MFGPT) events" depends on MGEODE_LX && GENERIC_TIME && GENERIC_CLOCKEVENTS - help + ---help--- This driver provides a clock event source based on the MFGPT timer(s) in the CS5535 and CS5536 companion chip for the geode. MFGPTs have a better resolution and max interval than the @@ -1925,7 +1961,7 @@ config GEODE_MFGPT_TIMER config OLPC bool "One Laptop Per Child support" default n - help + ---help--- Add support for detecting the unique features of the OLPC XO hardware. @@ -1950,16 +1986,16 @@ config IA32_EMULATION bool "IA32 Emulation" depends on X86_64 select COMPAT_BINFMT_ELF - help + ---help--- Include code to run 32-bit programs under a 64-bit kernel. You should likely turn this on, unless you're 100% sure that you don't have any 32-bit programs left. config IA32_AOUT - tristate "IA32 a.out support" - depends on IA32_EMULATION - help - Support old a.out binaries in the 32bit emulation. + tristate "IA32 a.out support" + depends on IA32_EMULATION + ---help--- + Support old a.out binaries in the 32bit emulation. config COMPAT def_bool y Index: linux-2.6-tip/arch/x86/Kconfig.cpu =================================================================== --- linux-2.6-tip.orig/arch/x86/Kconfig.cpu +++ linux-2.6-tip/arch/x86/Kconfig.cpu @@ -50,7 +50,7 @@ config M386 config M486 bool "486" depends on X86_32 - help + ---help--- Select this for a 486 series processor, either Intel or one of the compatible processors from AMD, Cyrix, IBM, or Intel. Includes DX, DX2, and DX4 variants; also SL/SLC/SLC2/SLC3/SX/SX2 and UMC U5D or @@ -59,7 +59,7 @@ config M486 config M586 bool "586/K5/5x86/6x86/6x86MX" depends on X86_32 - help + ---help--- Select this for an 586 or 686 series processor such as the AMD K5, the Cyrix 5x86, 6x86 and 6x86MX. This choice does not assume the RDTSC (Read Time Stamp Counter) instruction. @@ -67,21 +67,21 @@ config M586 config M586TSC bool "Pentium-Classic" depends on X86_32 - help + ---help--- Select this for a Pentium Classic processor with the RDTSC (Read Time Stamp Counter) instruction for benchmarking. config M586MMX bool "Pentium-MMX" depends on X86_32 - help + ---help--- Select this for a Pentium with the MMX graphics/multimedia extended instructions. config M686 bool "Pentium-Pro" depends on X86_32 - help + ---help--- Select this for Intel Pentium Pro chips. This enables the use of Pentium Pro extended instructions, and disables the init-time guard against the f00f bug found in earlier Pentiums. @@ -89,7 +89,7 @@ config M686 config MPENTIUMII bool "Pentium-II/Celeron(pre-Coppermine)" depends on X86_32 - help + ---help--- Select this for Intel chips based on the Pentium-II and pre-Coppermine Celeron core. This option enables an unaligned copy optimization, compiles the kernel with optimization flags @@ -99,7 +99,7 @@ config MPENTIUMII config MPENTIUMIII bool "Pentium-III/Celeron(Coppermine)/Pentium-III Xeon" depends on X86_32 - help + ---help--- Select this for Intel chips based on the Pentium-III and Celeron-Coppermine core. This option enables use of some extended prefetch instructions in addition to the Pentium II @@ -108,14 +108,14 @@ config MPENTIUMIII config MPENTIUMM bool "Pentium M" depends on X86_32 - help + ---help--- Select this for Intel Pentium M (not Pentium-4 M) notebook chips. config MPENTIUM4 bool "Pentium-4/Celeron(P4-based)/Pentium-4 M/older Xeon" depends on X86_32 - help + ---help--- Select this for Intel Pentium 4 chips. This includes the Pentium 4, Pentium D, P4-based Celeron and Xeon, and Pentium-4 M (not Pentium M) chips. This option enables compile @@ -151,7 +151,7 @@ config MPENTIUM4 config MK6 bool "K6/K6-II/K6-III" depends on X86_32 - help + ---help--- Select this for an AMD K6-family processor. Enables use of some extended instructions, and passes appropriate optimization flags to GCC. @@ -159,14 +159,14 @@ config MK6 config MK7 bool "Athlon/Duron/K7" depends on X86_32 - help + ---help--- Select this for an AMD Athlon K7-family processor. Enables use of some extended instructions, and passes appropriate optimization flags to GCC. config MK8 bool "Opteron/Athlon64/Hammer/K8" - help + ---help--- Select this for an AMD Opteron or Athlon64 Hammer-family processor. Enables use of some extended instructions, and passes appropriate optimization flags to GCC. @@ -174,7 +174,7 @@ config MK8 config MCRUSOE bool "Crusoe" depends on X86_32 - help + ---help--- Select this for a Transmeta Crusoe processor. Treats the processor like a 586 with TSC, and sets some GCC optimization flags (like a Pentium Pro with no alignment requirements). @@ -182,13 +182,13 @@ config MCRUSOE config MEFFICEON bool "Efficeon" depends on X86_32 - help + ---help--- Select this for a Transmeta Efficeon processor. config MWINCHIPC6 bool "Winchip-C6" depends on X86_32 - help + ---help--- Select this for an IDT Winchip C6 chip. Linux and GCC treat this chip as a 586TSC with some extended instructions and alignment requirements. @@ -196,7 +196,7 @@ config MWINCHIPC6 config MWINCHIP3D bool "Winchip-2/Winchip-2A/Winchip-3" depends on X86_32 - help + ---help--- Select this for an IDT Winchip-2, 2A or 3. Linux and GCC treat this chip as a 586TSC with some extended instructions and alignment requirements. Also enable out of order memory @@ -206,19 +206,19 @@ config MWINCHIP3D config MGEODEGX1 bool "GeodeGX1" depends on X86_32 - help + ---help--- Select this for a Geode GX1 (Cyrix MediaGX) chip. config MGEODE_LX bool "Geode GX/LX" depends on X86_32 - help + ---help--- Select this for AMD Geode GX and LX processors. config MCYRIXIII bool "CyrixIII/VIA-C3" depends on X86_32 - help + ---help--- Select this for a Cyrix III or C3 chip. Presently Linux and GCC treat this chip as a generic 586. Whilst the CPU is 686 class, it lacks the cmov extension which gcc assumes is present when @@ -230,7 +230,7 @@ config MCYRIXIII config MVIAC3_2 bool "VIA C3-2 (Nehemiah)" depends on X86_32 - help + ---help--- Select this for a VIA C3 "Nehemiah". Selecting this enables usage of SSE and tells gcc to treat the CPU as a 686. Note, this kernel will not boot on older (pre model 9) C3s. @@ -238,14 +238,14 @@ config MVIAC3_2 config MVIAC7 bool "VIA C7" depends on X86_32 - help + ---help--- Select this for a VIA C7. Selecting this uses the correct cache shift and tells gcc to treat the CPU as a 686. config MPSC bool "Intel P4 / older Netburst based Xeon" depends on X86_64 - help + ---help--- Optimize for Intel Pentium 4, Pentium D and older Nocona/Dempsey Xeon CPUs with Intel 64bit which is compatible with x86-64. Note that the latest Xeons (Xeon 51xx and 53xx) are not based on the @@ -255,7 +255,7 @@ config MPSC config MCORE2 bool "Core 2/newer Xeon" - help + ---help--- Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and 53xx) CPUs. You can distinguish newer from older Xeons by the CPU @@ -265,7 +265,7 @@ config MCORE2 config GENERIC_CPU bool "Generic-x86-64" depends on X86_64 - help + ---help--- Generic x86-64 CPU. Run equally well on all x86-64 CPUs. @@ -274,7 +274,7 @@ endchoice config X86_GENERIC bool "Generic x86 support" depends on X86_32 - help + ---help--- Instead of just including optimizations for the selected x86 variant (e.g. PII, Crusoe or Athlon), include some more generic optimizations as well. This will make the kernel @@ -294,25 +294,23 @@ config X86_CPU # Define implied options from the CPU selection here config X86_L1_CACHE_BYTES int - default "128" if GENERIC_CPU || MPSC - default "64" if MK8 || MCORE2 - depends on X86_64 + default "128" if MPSC + default "64" if GENERIC_CPU || MK8 || MCORE2 || X86_32 config X86_INTERNODE_CACHE_BYTES int default "4096" if X86_VSMP default X86_L1_CACHE_BYTES if !X86_VSMP - depends on X86_64 config X86_CMPXCHG def_bool X86_64 || (X86_32 && !M386) config X86_L1_CACHE_SHIFT int - default "7" if MPENTIUM4 || X86_GENERIC || GENERIC_CPU || MPSC + default "7" if MPENTIUM4 || MPSC default "4" if X86_ELAN || M486 || M386 || MGEODEGX1 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX - default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 || X86_GENERIC || GENERIC_CPU config X86_XADD def_bool y @@ -321,7 +319,7 @@ config X86_XADD config X86_PPRO_FENCE bool "PentiumPro memory ordering errata workaround" depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1 - help + ---help--- Old PentiumPro multiprocessor systems had errata that could cause memory operations to violate the x86 ordering standard in rare cases. Enabling this option will attempt to work around some (but not all) @@ -414,14 +412,14 @@ config X86_DEBUGCTLMSR menuconfig PROCESSOR_SELECT bool "Supported processor vendors" if EMBEDDED - help + ---help--- This lets you choose what x86 vendor support code your kernel will include. config CPU_SUP_INTEL default y bool "Support Intel processors" if PROCESSOR_SELECT - help + ---help--- This enables detection, tunings and quirks for Intel processors You need this enabled if you want your kernel to run on an @@ -435,7 +433,7 @@ config CPU_SUP_CYRIX_32 default y bool "Support Cyrix processors" if PROCESSOR_SELECT depends on !64BIT - help + ---help--- This enables detection, tunings and quirks for Cyrix processors You need this enabled if you want your kernel to run on a @@ -448,7 +446,7 @@ config CPU_SUP_CYRIX_32 config CPU_SUP_AMD default y bool "Support AMD processors" if PROCESSOR_SELECT - help + ---help--- This enables detection, tunings and quirks for AMD processors You need this enabled if you want your kernel to run on an @@ -462,7 +460,7 @@ config CPU_SUP_CENTAUR_32 default y bool "Support Centaur processors" if PROCESSOR_SELECT depends on !64BIT - help + ---help--- This enables detection, tunings and quirks for Centaur processors You need this enabled if you want your kernel to run on a @@ -476,7 +474,7 @@ config CPU_SUP_CENTAUR_64 default y bool "Support Centaur processors" if PROCESSOR_SELECT depends on 64BIT - help + ---help--- This enables detection, tunings and quirks for Centaur processors You need this enabled if you want your kernel to run on a @@ -490,7 +488,7 @@ config CPU_SUP_TRANSMETA_32 default y bool "Support Transmeta processors" if PROCESSOR_SELECT depends on !64BIT - help + ---help--- This enables detection, tunings and quirks for Transmeta processors You need this enabled if you want your kernel to run on a @@ -504,7 +502,7 @@ config CPU_SUP_UMC_32 default y bool "Support UMC processors" if PROCESSOR_SELECT depends on !64BIT - help + ---help--- This enables detection, tunings and quirks for UMC processors You need this enabled if you want your kernel to run on a @@ -523,7 +521,7 @@ config X86_PTRACE_BTS bool "Branch Trace Store" default y depends on X86_DEBUGCTLMSR - help + ---help--- This adds a ptrace interface to the hardware's branch trace store. Debuggers may use it to collect an execution trace of the debugged Index: linux-2.6-tip/arch/x86/Kconfig.debug =================================================================== --- linux-2.6-tip.orig/arch/x86/Kconfig.debug +++ linux-2.6-tip/arch/x86/Kconfig.debug @@ -7,7 +7,7 @@ source "lib/Kconfig.debug" config STRICT_DEVMEM bool "Filter access to /dev/mem" - help + ---help--- If this option is disabled, you allow userspace (root) access to all of memory, including kernel and userspace memory. Accidental access to this is obviously disastrous, but specific access can @@ -25,7 +25,7 @@ config STRICT_DEVMEM config X86_VERBOSE_BOOTUP bool "Enable verbose x86 bootup info messages" default y - help + ---help--- Enables the informational output from the decompression stage (e.g. bzImage) of the boot. If you disable this you will still see errors. Disable this if you want silent bootup. @@ -33,7 +33,7 @@ config X86_VERBOSE_BOOTUP config EARLY_PRINTK bool "Early printk" if EMBEDDED default y - help + ---help--- Write kernel log output directly into the VGA buffer or to a serial port. @@ -47,7 +47,7 @@ config EARLY_PRINTK_DBGP bool "Early printk via EHCI debug port" default n depends on EARLY_PRINTK && PCI - help + ---help--- Write kernel log output directly into the EHCI debug port. This is useful for kernel debugging when your machine crashes very @@ -59,14 +59,14 @@ config EARLY_PRINTK_DBGP config DEBUG_STACKOVERFLOW bool "Check for stack overflows" depends on DEBUG_KERNEL - help + ---help--- This option will cause messages to be printed if free stack space drops below a certain limit. config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL - help + ---help--- Enables the display of the minimum amount of free stack which each task has ever had available in the sysrq-T and sysrq-P debug output. @@ -75,7 +75,7 @@ config DEBUG_STACK_USAGE config DEBUG_PAGEALLOC bool "Debug page memory allocations" depends on DEBUG_KERNEL - help + ---help--- Unmap pages from the kernel linear mapping after free_pages(). This results in a large slowdown, but helps to find certain types of memory corruptions. @@ -83,9 +83,9 @@ config DEBUG_PAGEALLOC config DEBUG_PER_CPU_MAPS bool "Debug access to per_cpu maps" depends on DEBUG_KERNEL - depends on X86_SMP + depends on SMP default n - help + ---help--- Say Y to verify that the per_cpu map being accessed has been setup. Adds a fair amount of code to kernel memory and decreases performance. @@ -96,7 +96,7 @@ config X86_PTDUMP bool "Export kernel pagetable layout to userspace via debugfs" depends on DEBUG_KERNEL select DEBUG_FS - help + ---help--- Say Y here if you want to show the kernel pagetable layout in a debugfs file. This information is only useful for kernel developers who are working in architecture specific areas of the kernel. @@ -108,7 +108,7 @@ config DEBUG_RODATA bool "Write protect kernel read-only data structures" default y depends on DEBUG_KERNEL - help + ---help--- Mark the kernel read-only data as write-protected in the pagetables, in order to catch accidental (and incorrect) writes to such const data. This is recommended so that we can catch kernel bugs sooner. @@ -117,7 +117,8 @@ config DEBUG_RODATA config DEBUG_RODATA_TEST bool "Testcase for the DEBUG_RODATA feature" depends on DEBUG_RODATA - help + default y + ---help--- This option enables a testcase for the DEBUG_RODATA feature as well as for the change_page_attr() infrastructure. If in doubt, say "N" @@ -125,7 +126,7 @@ config DEBUG_RODATA_TEST config DEBUG_NX_TEST tristate "Testcase for the NX non-executable stack feature" depends on DEBUG_KERNEL && m - help + ---help--- This option enables a testcase for the CPU NX capability and the software setup of this feature. If in doubt, say "N" @@ -133,7 +134,8 @@ config DEBUG_NX_TEST config 4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on X86_32 - help + default y + ---help--- If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates running more threads on a system and also reduces the pressure @@ -144,7 +146,7 @@ config DOUBLEFAULT default y bool "Enable doublefault exception handler" if EMBEDDED depends on X86_32 - help + ---help--- This option allows trapping of rare doublefault exceptions that would otherwise cause a system to silently reboot. Disabling this option saves about 4k and might cause you much additional grey @@ -154,7 +156,7 @@ config IOMMU_DEBUG bool "Enable IOMMU debugging" depends on GART_IOMMU && DEBUG_KERNEL depends on X86_64 - help + ---help--- Force the IOMMU to on even when you have less than 4GB of memory and add debugging code. On overflow always panic. And allow to enable IOMMU leak tracing. Can be disabled at boot @@ -170,7 +172,7 @@ config IOMMU_LEAK bool "IOMMU leak tracing" depends on DEBUG_KERNEL depends on IOMMU_DEBUG - help + ---help--- Add a simple leak tracer to the IOMMU code. This is useful when you are debugging a buggy device driver that leaks IOMMU mappings. @@ -203,25 +205,25 @@ choice config IO_DELAY_0X80 bool "port 0x80 based port-IO delay [recommended]" - help + ---help--- This is the traditional Linux IO delay used for in/out_p. It is the most tested hence safest selection here. config IO_DELAY_0XED bool "port 0xed based port-IO delay" - help + ---help--- Use port 0xed as the IO delay. This frees up port 0x80 which is often used as a hardware-debug port. config IO_DELAY_UDELAY bool "udelay based port-IO delay" - help + ---help--- Use udelay(2) as the IO delay method. This provides the delay while not having any side-effect on the IO port space. config IO_DELAY_NONE bool "no port-IO delay" - help + ---help--- No port-IO delay. Will break on old boxes that require port-IO delay for certain operations. Should work on most new machines. @@ -251,22 +253,111 @@ config DEFAULT_IO_DELAY_TYPE default IO_DELAY_TYPE_NONE endif +menuconfig KMEMCHECK + bool "kmemcheck: trap use of uninitialized memory" + depends on X86 + depends on !X86_USE_3DNOW + depends on (SLUB && !SLUB_DEBUG_ON) || (SLAB && !DEBUG_SLAB) + depends on !CC_OPTIMIZE_FOR_SIZE + depends on !DEBUG_PAGEALLOC + depends on !FUNCTION_TRACER + select FRAME_POINTER + select STACKTRACE + default n + help + This option enables tracing of dynamically allocated kernel memory + to see if memory is used before it has been given an initial value. + Be aware that this requires half of your memory for bookkeeping and + will insert extra code at *every* read and write to tracked memory + thus slow down the kernel code (but user code is unaffected). + + The kernel may be started with kmemcheck=0 or kmemcheck=1 to disable + or enable kmemcheck at boot-time. If the kernel is started with + kmemcheck=0, the large memory and CPU overhead is not incurred. + +choice + prompt "kmemcheck: default mode at boot" + depends on KMEMCHECK + default KMEMCHECK_ONESHOT_BY_DEFAULT + help + This option controls the default behaviour of kmemcheck when the + kernel boots and no kmemcheck= parameter is given. + +config KMEMCHECK_DISABLED_BY_DEFAULT + bool "disabled" + depends on KMEMCHECK + +config KMEMCHECK_ENABLED_BY_DEFAULT + bool "enabled" + depends on KMEMCHECK + +config KMEMCHECK_ONESHOT_BY_DEFAULT + bool "one-shot" + depends on KMEMCHECK + help + In one-shot mode, only the first error detected is reported before + kmemcheck is disabled. + +endchoice + +config KMEMCHECK_QUEUE_SIZE + int "kmemcheck: error queue size" + depends on KMEMCHECK + default 64 + help + Select the maximum number of errors to store in the queue. Since + errors can occur virtually anywhere and in any context, we need a + temporary storage area which is guarantueed not to generate any + other faults. The queue will be emptied as soon as a tasklet may + be scheduled. If the queue is full, new error reports will be + lost. + +config KMEMCHECK_SHADOW_COPY_SHIFT + int "kmemcheck: shadow copy size (5 => 32 bytes, 6 => 64 bytes)" + depends on KMEMCHECK + range 2 8 + default 5 + help + Select the number of shadow bytes to save along with each entry of + the queue. These bytes indicate what parts of an allocation are + initialized, uninitialized, etc. and will be displayed when an + error is detected to help the debugging of a particular problem. + +config KMEMCHECK_PARTIAL_OK + bool "kmemcheck: allow partially uninitialized memory" + depends on KMEMCHECK + default y + help + This option works around certain GCC optimizations that produce + 32-bit reads from 16-bit variables where the upper 16 bits are + thrown away afterwards. This may of course also hide some real + bugs. + +config KMEMCHECK_BITOPS_OK + bool "kmemcheck: allow bit-field manipulation" + depends on KMEMCHECK + default n + help + This option silences warnings that would be generated for bit-field + accesses where not all the bits are initialized at the same time. + This may also hide some real bugs. + config DEBUG_BOOT_PARAMS bool "Debug boot parameters" depends on DEBUG_KERNEL depends on DEBUG_FS - help + ---help--- This option will cause struct boot_params to be exported via debugfs. config CPA_DEBUG bool "CPA self-test code" depends on DEBUG_KERNEL - help + ---help--- Do change_page_attr() self-tests every 30 seconds. config OPTIMIZE_INLINING bool "Allow gcc to uninline functions marked 'inline'" - help + ---help--- This option determines if the kernel forces gcc to inline the functions developers have marked 'inline'. Doing so takes away freedom from gcc to do what it thinks is best, which is desirable for the gcc 3.x series of @@ -279,4 +370,3 @@ config OPTIMIZE_INLINING If unsure, say N. endmenu - Index: linux-2.6-tip/arch/x86/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/Makefile +++ linux-2.6-tip/arch/x86/Makefile @@ -70,14 +70,22 @@ else # this works around some issues with generating unwind tables in older gccs # newer gccs do it by default KBUILD_CFLAGS += -maccumulate-outgoing-args +endif - stackp := $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh - stackp-$(CONFIG_CC_STACKPROTECTOR) := $(shell $(stackp) \ - "$(CC)" -fstack-protector ) - stackp-$(CONFIG_CC_STACKPROTECTOR_ALL) += $(shell $(stackp) \ - "$(CC)" -fstack-protector-all ) +ifdef CONFIG_CC_STACKPROTECTOR + cc_has_sp := $(srctree)/scripts/gcc-x86_$(BITS)-has-stack-protector.sh + ifeq ($(shell $(CONFIG_SHELL) $(cc_has_sp) $(CC)),y) + stackp-y := -fstack-protector + stackp-$(CONFIG_CC_STACKPROTECTOR_ALL) += -fstack-protector-all + KBUILD_CFLAGS += $(stackp-y) + else + $(warning stack protector enabled but no compiler support) + endif +endif - KBUILD_CFLAGS += $(stackp-y) +# Don't unroll struct assignments with kmemcheck enabled +ifeq ($(CONFIG_KMEMCHECK),y) + KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) endif # Stackpointer is addressed different for 32 bit and 64 bit x86 @@ -102,29 +110,6 @@ KBUILD_CFLAGS += -fno-asynchronous-unwin # prevent gcc from generating any FP code by mistake KBUILD_CFLAGS += $(call cc-option,-mno-sse -mno-mmx -mno-sse2 -mno-3dnow,) -### -# Sub architecture support -# fcore-y is linked before mcore-y files. - -# Default subarch .c files -mcore-y := arch/x86/mach-default/ - -# Voyager subarch support -mflags-$(CONFIG_X86_VOYAGER) := -Iarch/x86/include/asm/mach-voyager -mcore-$(CONFIG_X86_VOYAGER) := arch/x86/mach-voyager/ - -# generic subarchitecture -mflags-$(CONFIG_X86_GENERICARCH):= -Iarch/x86/include/asm/mach-generic -fcore-$(CONFIG_X86_GENERICARCH) += arch/x86/mach-generic/ -mcore-$(CONFIG_X86_GENERICARCH) := arch/x86/mach-default/ - -# default subarch .h files -mflags-y += -Iarch/x86/include/asm/mach-default - -# 64 bit does not support subarch support - clear sub arch variables -fcore-$(CONFIG_X86_64) := -mcore-$(CONFIG_X86_64) := - KBUILD_CFLAGS += $(mflags-y) KBUILD_AFLAGS += $(mflags-y) @@ -150,9 +135,6 @@ core-$(CONFIG_LGUEST_GUEST) += arch/x86/ core-y += arch/x86/kernel/ core-y += arch/x86/mm/ -# Remaining sub architecture files -core-y += $(mcore-y) - core-y += arch/x86/crypto/ core-y += arch/x86/vdso/ core-$(CONFIG_IA32_EMULATION) += arch/x86/ia32/ Index: linux-2.6-tip/arch/x86/boot/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/Makefile +++ linux-2.6-tip/arch/x86/boot/Makefile @@ -32,7 +32,6 @@ setup-y += a20.o cmdline.o copy.o cpu.o setup-y += header.o main.o mca.o memory.o pm.o pmjump.o setup-y += printf.o string.o tty.o video.o video-mode.o version.o setup-$(CONFIG_X86_APM_BOOT) += apm.o -setup-$(CONFIG_X86_VOYAGER) += voyager.o # The link order of the video-*.o modules can matter. In particular, # video-vga.o *must* be listed first, followed by video-vesa.o. Index: linux-2.6-tip/arch/x86/boot/a20.c =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/a20.c +++ linux-2.6-tip/arch/x86/boot/a20.c @@ -2,6 +2,7 @@ * * Copyright (C) 1991, 1992 Linus Torvalds * Copyright 2007-2008 rPath, Inc. - All Rights Reserved + * Copyright 2009 Intel Corporation * * This file is part of the Linux kernel, and is made available under * the terms of the GNU General Public License version 2. @@ -15,16 +16,23 @@ #include "boot.h" #define MAX_8042_LOOPS 100000 +#define MAX_8042_FF 32 static int empty_8042(void) { u8 status; int loops = MAX_8042_LOOPS; + int ffs = MAX_8042_FF; while (loops--) { io_delay(); status = inb(0x64); + if (status == 0xff) { + /* FF is a plausible, but very unlikely status */ + if (!--ffs) + return -1; /* Assume no KBC present */ + } if (status & 1) { /* Read and discard input data */ io_delay(); @@ -118,44 +126,37 @@ static void enable_a20_fast(void) int enable_a20(void) { -#if defined(CONFIG_X86_ELAN) - /* Elan croaks if we try to touch the KBC */ - enable_a20_fast(); - while (!a20_test_long()) - ; - return 0; -#elif defined(CONFIG_X86_VOYAGER) - /* On Voyager, a20_test() is unsafe? */ - enable_a20_kbc(); - return 0; -#else int loops = A20_ENABLE_LOOPS; - while (loops--) { - /* First, check to see if A20 is already enabled - (legacy free, etc.) */ - if (a20_test_short()) - return 0; - - /* Next, try the BIOS (INT 0x15, AX=0x2401) */ - enable_a20_bios(); - if (a20_test_short()) - return 0; - - /* Try enabling A20 through the keyboard controller */ - empty_8042(); - if (a20_test_short()) - return 0; /* BIOS worked, but with delayed reaction */ + int kbc_err; - enable_a20_kbc(); - if (a20_test_long()) - return 0; - - /* Finally, try enabling the "fast A20 gate" */ - enable_a20_fast(); - if (a20_test_long()) - return 0; - } - - return -1; -#endif + while (loops--) { + /* First, check to see if A20 is already enabled + (legacy free, etc.) */ + if (a20_test_short()) + return 0; + + /* Next, try the BIOS (INT 0x15, AX=0x2401) */ + enable_a20_bios(); + if (a20_test_short()) + return 0; + + /* Try enabling A20 through the keyboard controller */ + kbc_err = empty_8042(); + + if (a20_test_short()) + return 0; /* BIOS worked, but with delayed reaction */ + + if (!kbc_err) { + enable_a20_kbc(); + if (a20_test_long()) + return 0; + } + + /* Finally, try enabling the "fast A20 gate" */ + enable_a20_fast(); + if (a20_test_long()) + return 0; + } + + return -1; } Index: linux-2.6-tip/arch/x86/boot/boot.h =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/boot.h +++ linux-2.6-tip/arch/x86/boot/boot.h @@ -302,9 +302,6 @@ void probe_cards(int unsafe); /* video-vesa.c */ void vesa_store_edid(void); -/* voyager.c */ -int query_voyager(void); - #endif /* __ASSEMBLY__ */ #endif /* BOOT_BOOT_H */ Index: linux-2.6-tip/arch/x86/boot/compressed/head_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/compressed/head_32.S +++ linux-2.6-tip/arch/x86/boot/compressed/head_32.S @@ -25,14 +25,12 @@ #include #include -#include +#include #include #include .section ".text.head","ax",@progbits - .globl startup_32 - -startup_32: +ENTRY(startup_32) cld /* test KEEP_SEGMENTS flag to see if the bootloader is asking * us to not reload segments */ @@ -113,6 +111,8 @@ startup_32: */ leal relocated(%ebx), %eax jmp *%eax +ENDPROC(startup_32) + .section ".text" relocated: Index: linux-2.6-tip/arch/x86/boot/compressed/head_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/compressed/head_64.S +++ linux-2.6-tip/arch/x86/boot/compressed/head_64.S @@ -26,8 +26,8 @@ #include #include -#include -#include +#include +#include #include #include #include @@ -35,9 +35,7 @@ .section ".text.head" .code32 - .globl startup_32 - -startup_32: +ENTRY(startup_32) cld /* test KEEP_SEGMENTS flag to see if the bootloader is asking * us to not reload segments */ @@ -176,6 +174,7 @@ startup_32: /* Jump from 32bit compatibility mode into 64bit mode. */ lret +ENDPROC(startup_32) no_longmode: /* This isn't an x86-64 CPU so hang */ @@ -295,7 +294,6 @@ relocated: call decompress_kernel popq %rsi - /* * Jump to the decompressed kernel. */ Index: linux-2.6-tip/arch/x86/boot/copy.S =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/copy.S +++ linux-2.6-tip/arch/x86/boot/copy.S @@ -8,6 +8,8 @@ * * ----------------------------------------------------------------------- */ +#include + /* * Memory copy routines */ @@ -15,9 +17,7 @@ .code16gcc .text - .globl memcpy - .type memcpy, @function -memcpy: +GLOBAL(memcpy) pushw %si pushw %di movw %ax, %di @@ -31,11 +31,9 @@ memcpy: popw %di popw %si ret - .size memcpy, .-memcpy +ENDPROC(memcpy) - .globl memset - .type memset, @function -memset: +GLOBAL(memset) pushw %di movw %ax, %di movzbl %dl, %eax @@ -48,52 +46,42 @@ memset: rep; stosb popw %di ret - .size memset, .-memset +ENDPROC(memset) - .globl copy_from_fs - .type copy_from_fs, @function -copy_from_fs: +GLOBAL(copy_from_fs) pushw %ds pushw %fs popw %ds call memcpy popw %ds ret - .size copy_from_fs, .-copy_from_fs +ENDPROC(copy_from_fs) - .globl copy_to_fs - .type copy_to_fs, @function -copy_to_fs: +GLOBAL(copy_to_fs) pushw %es pushw %fs popw %es call memcpy popw %es ret - .size copy_to_fs, .-copy_to_fs +ENDPROC(copy_to_fs) #if 0 /* Not currently used, but can be enabled as needed */ - - .globl copy_from_gs - .type copy_from_gs, @function -copy_from_gs: +GLOBAL(copy_from_gs) pushw %ds pushw %gs popw %ds call memcpy popw %ds ret - .size copy_from_gs, .-copy_from_gs - .globl copy_to_gs +ENDPROC(copy_from_gs) - .type copy_to_gs, @function -copy_to_gs: +GLOBAL(copy_to_gs) pushw %es pushw %gs popw %es call memcpy popw %es ret - .size copy_to_gs, .-copy_to_gs - +ENDPROC(copy_to_gs) #endif Index: linux-2.6-tip/arch/x86/boot/header.S =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/header.S +++ linux-2.6-tip/arch/x86/boot/header.S @@ -19,7 +19,7 @@ #include #include #include -#include +#include #include #include "boot.h" #include "offsets.h" Index: linux-2.6-tip/arch/x86/boot/main.c =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/main.c +++ linux-2.6-tip/arch/x86/boot/main.c @@ -149,11 +149,6 @@ void main(void) /* Query MCA information */ query_mca(); - /* Voyager */ -#ifdef CONFIG_X86_VOYAGER - query_voyager(); -#endif - /* Query Intel SpeedStep (IST) information */ query_ist(); Index: linux-2.6-tip/arch/x86/boot/pmjump.S =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/pmjump.S +++ linux-2.6-tip/arch/x86/boot/pmjump.S @@ -15,18 +15,15 @@ #include #include #include +#include .text - - .globl protected_mode_jump - .type protected_mode_jump, @function - .code16 /* * void protected_mode_jump(u32 entrypoint, u32 bootparams); */ -protected_mode_jump: +GLOBAL(protected_mode_jump) movl %edx, %esi # Pointer to boot_params table xorl %ebx, %ebx @@ -47,12 +44,10 @@ protected_mode_jump: .byte 0x66, 0xea # ljmpl opcode 2: .long in_pm32 # offset .word __BOOT_CS # segment - - .size protected_mode_jump, .-protected_mode_jump +ENDPROC(protected_mode_jump) .code32 - .type in_pm32, @function -in_pm32: +GLOBAL(in_pm32) # Set up data segments for flat 32-bit mode movl %ecx, %ds movl %ecx, %es @@ -78,5 +73,4 @@ in_pm32: lldt %cx jmpl *%eax # Jump to the 32-bit entrypoint - - .size in_pm32, .-in_pm32 +ENDPROC(in_pm32) Index: linux-2.6-tip/arch/x86/boot/video-mode.c =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/video-mode.c +++ linux-2.6-tip/arch/x86/boot/video-mode.c @@ -147,7 +147,7 @@ static void vga_recalc_vertical(void) int set_mode(u16 mode) { int rv; - u16 real_mode; + u16 uninitialized_var(real_mode); /* Very special mode numbers... */ if (mode == VIDEO_CURRENT_MODE) Index: linux-2.6-tip/arch/x86/boot/voyager.c =================================================================== --- linux-2.6-tip.orig/arch/x86/boot/voyager.c +++ /dev/null @@ -1,40 +0,0 @@ -/* -*- linux-c -*- ------------------------------------------------------- * - * - * Copyright (C) 1991, 1992 Linus Torvalds - * Copyright 2007 rPath, Inc. - All Rights Reserved - * - * This file is part of the Linux kernel, and is made available under - * the terms of the GNU General Public License version 2. - * - * ----------------------------------------------------------------------- */ - -/* - * Get the Voyager config information - */ - -#include "boot.h" - -int query_voyager(void) -{ - u8 err; - u16 es, di; - /* Abuse the apm_bios_info area for this */ - u8 *data_ptr = (u8 *)&boot_params.apm_bios_info; - - data_ptr[0] = 0xff; /* Flag on config not found(?) */ - - asm("pushw %%es ; " - "int $0x15 ; " - "setc %0 ; " - "movw %%es, %1 ; " - "popw %%es" - : "=q" (err), "=r" (es), "=D" (di) - : "a" (0xffc0)); - - if (err) - return -1; /* Not Voyager */ - - set_fs(es); - copy_from_fs(data_ptr, di, 7); /* Table is 7 bytes apparently */ - return 0; -} Index: linux-2.6-tip/arch/x86/configs/i386_defconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/configs/i386_defconfig +++ linux-2.6-tip/arch/x86/configs/i386_defconfig @@ -1,14 +1,13 @@ # # Automatically generated make config: don't edit -# Linux kernel version: 2.6.27-rc5 -# Wed Sep 3 17:23:09 2008 +# Linux kernel version: 2.6.29-rc4 +# Thu Feb 12 12:57:57 2009 # # CONFIG_64BIT is not set CONFIG_X86_32=y # CONFIG_X86_64 is not set CONFIG_X86=y CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig" -# CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y @@ -24,16 +23,14 @@ CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y -# CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y -# CONFIG_ARCH_HAS_ILOG2_U32 is not set -# CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y # CONFIG_GENERIC_TIME_VSYSCALL is not set CONFIG_ARCH_HAS_CPU_RELAX=y +CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y # CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set @@ -42,12 +39,12 @@ CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_ZONE_DMA32 is not set CONFIG_ARCH_POPULATES_NODE_MAP=y # CONFIG_AUDIT_ARCH is not set -CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y +CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_32_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y @@ -76,30 +73,44 @@ CONFIG_TASK_IO_ACCOUNTING=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y + +# +# RCU Subsystem +# +# CONFIG_CLASSIC_RCU is not set +CONFIG_TREE_RCU=y +# CONFIG_PREEMPT_RCU is not set +# CONFIG_RCU_TRACE is not set +CONFIG_RCU_FANOUT=32 +# CONFIG_RCU_FANOUT_EXACT is not set +# CONFIG_TREE_RCU_TRACE is not set +# CONFIG_PREEMPT_RCU_TRACE is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=18 -CONFIG_CGROUPS=y -# CONFIG_CGROUP_DEBUG is not set -CONFIG_CGROUP_NS=y -# CONFIG_CGROUP_DEVICE is not set -CONFIG_CPUSETS=y CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set # CONFIG_USER_SCHED is not set CONFIG_CGROUP_SCHED=y +CONFIG_CGROUPS=y +# CONFIG_CGROUP_DEBUG is not set +CONFIG_CGROUP_NS=y +CONFIG_CGROUP_FREEZER=y +# CONFIG_CGROUP_DEVICE is not set +CONFIG_CPUSETS=y +CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_CPUACCT=y CONFIG_RESOURCE_COUNTERS=y # CONFIG_CGROUP_MEM_RES_CTLR is not set # CONFIG_SYSFS_DEPRECATED_V2 is not set -CONFIG_PROC_PID_CPUSET=y CONFIG_RELAY=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y +CONFIG_NET_NS=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y @@ -124,12 +135,15 @@ CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y +CONFIG_AIO=y CONFIG_VM_EVENT_COUNTERS=y +CONFIG_PCI_QUIRKS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_PROFILING=y +CONFIG_TRACEPOINTS=y CONFIG_MARKERS=y # CONFIG_OPROFILE is not set CONFIG_HAVE_OPROFILE=y @@ -139,15 +153,10 @@ CONFIG_KRETPROBES=y CONFIG_HAVE_IOREMAP_PROT=y CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y -# CONFIG_HAVE_ARCH_TRACEHOOK is not set -# CONFIG_HAVE_DMA_ATTRS is not set -CONFIG_USE_GENERIC_SMP_HELPERS=y -# CONFIG_HAVE_CLK is not set -CONFIG_PROC_PAGE_MONITOR=y +CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_HAVE_GENERIC_DMA_COHERENT=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y -# CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y # CONFIG_MODULE_FORCE_LOAD is not set @@ -155,12 +164,10 @@ CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set -CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_LBD is not set CONFIG_BLK_DEV_IO_TRACE=y -# CONFIG_LSF is not set CONFIG_BLK_DEV_BSG=y # CONFIG_BLK_DEV_INTEGRITY is not set @@ -176,7 +183,7 @@ CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" -CONFIG_CLASSIC_RCU=y +CONFIG_FREEZER=y # # Processor type and features @@ -186,15 +193,14 @@ CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y +CONFIG_SPARSE_IRQ=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y -CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set -# CONFIG_X86_VOYAGER is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_VSMP is not set # CONFIG_X86_RDC321X is not set -CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y +CONFIG_SCHED_OMIT_FRAME_POINTER=y # CONFIG_PARAVIRT_GUEST is not set # CONFIG_MEMTEST is not set # CONFIG_M386 is not set @@ -238,10 +244,19 @@ CONFIG_X86_TSC=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=4 CONFIG_X86_DEBUGCTLMSR=y +CONFIG_CPU_SUP_INTEL=y +CONFIG_CPU_SUP_CYRIX_32=y +CONFIG_CPU_SUP_AMD=y +CONFIG_CPU_SUP_CENTAUR_32=y +CONFIG_CPU_SUP_TRANSMETA_32=y +CONFIG_CPU_SUP_UMC_32=y +CONFIG_X86_DS=y +CONFIG_X86_PTRACE_BTS=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_DMI=y # CONFIG_IOMMU_HELPER is not set +# CONFIG_IOMMU_API is not set CONFIG_NR_CPUS=64 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y @@ -250,12 +265,15 @@ CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y +CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y # CONFIG_X86_MCE is not set CONFIG_VM86=y # CONFIG_TOSHIBA is not set # CONFIG_I8K is not set CONFIG_X86_REBOOTFIXUPS=y CONFIG_MICROCODE=y +CONFIG_MICROCODE_INTEL=y +CONFIG_MICROCODE_AMD=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y @@ -264,6 +282,7 @@ CONFIG_HIGHMEM4G=y # CONFIG_HIGHMEM64G is not set CONFIG_PAGE_OFFSET=0xC0000000 CONFIG_HIGHMEM=y +# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y @@ -274,14 +293,17 @@ CONFIG_FLATMEM_MANUAL=y CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPARSEMEM_STATIC=y -# CONFIG_SPARSEMEM_VMEMMAP_ENABLE is not set CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 -CONFIG_RESOURCES_64BIT=y +# CONFIG_PHYS_ADDR_T_64BIT is not set CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y +CONFIG_UNEVICTABLE_LRU=y CONFIG_HIGHPTE=y +CONFIG_X86_CHECK_BIOS_CORRUPTION=y +CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y +CONFIG_X86_RESERVE_LOW_64K=y # CONFIG_MATH_EMULATION is not set CONFIG_MTRR=y # CONFIG_MTRR_SANITIZER is not set @@ -302,10 +324,11 @@ CONFIG_PHYSICAL_START=0x1000000 CONFIG_PHYSICAL_ALIGN=0x200000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set +# CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y # -# Power management options +# Power management and ACPI options # CONFIG_PM=y CONFIG_PM_DEBUG=y @@ -331,19 +354,13 @@ CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=y -# CONFIG_ACPI_BAY is not set CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y CONFIG_ACPI_THERMAL=y -# CONFIG_ACPI_WMI is not set -# CONFIG_ACPI_ASUS is not set -# CONFIG_ACPI_TOSHIBA is not set # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set -CONFIG_ACPI_EC=y # CONFIG_ACPI_PCI_SLOT is not set -CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y @@ -388,7 +405,6 @@ CONFIG_X86_ACPI_CPUFREQ=y # # shared options # -# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set CONFIG_CPU_IDLE=y CONFIG_CPU_IDLE_GOV_LADDER=y @@ -415,6 +431,7 @@ CONFIG_ARCH_SUPPORTS_MSI=y CONFIG_PCI_MSI=y # CONFIG_PCI_LEGACY is not set # CONFIG_PCI_DEBUG is not set +# CONFIG_PCI_STUB is not set CONFIG_HT_IRQ=y CONFIG_ISA_DMA_API=y # CONFIG_ISA is not set @@ -452,13 +469,17 @@ CONFIG_HOTPLUG_PCI=y # Executable file formats / Emulations # CONFIG_BINFMT_ELF=y +CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y +CONFIG_HAVE_AOUT=y # CONFIG_BINFMT_AOUT is not set CONFIG_BINFMT_MISC=y +CONFIG_HAVE_ATOMIC_IOMAP=y CONFIG_NET=y # # Networking options # +CONFIG_COMPAT_NET_DEV_OPS=y CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y @@ -519,7 +540,6 @@ CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_TCP_MD5SIG=y -# CONFIG_IP_VS is not set CONFIG_IPV6=y # CONFIG_IPV6_PRIVACY is not set # CONFIG_IPV6_ROUTER_PREF is not set @@ -557,19 +577,21 @@ CONFIG_NF_CONNTRACK_IRC=y CONFIG_NF_CONNTRACK_SIP=y CONFIG_NF_CT_NETLINK=y CONFIG_NETFILTER_XTABLES=y +CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y CONFIG_NETFILTER_XT_TARGET_MARK=y CONFIG_NETFILTER_XT_TARGET_NFLOG=y CONFIG_NETFILTER_XT_TARGET_SECMARK=y -CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y CONFIG_NETFILTER_XT_TARGET_TCPMSS=y CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y CONFIG_NETFILTER_XT_MATCH_MARK=y CONFIG_NETFILTER_XT_MATCH_POLICY=y CONFIG_NETFILTER_XT_MATCH_STATE=y +# CONFIG_IP_VS is not set # # IP: Netfilter Configuration # +CONFIG_NF_DEFRAG_IPV4=y CONFIG_NF_CONNTRACK_IPV4=y CONFIG_NF_CONNTRACK_PROC_COMPAT=y CONFIG_IP_NF_IPTABLES=y @@ -595,8 +617,8 @@ CONFIG_IP_NF_MANGLE=y CONFIG_NF_CONNTRACK_IPV6=y CONFIG_IP6_NF_IPTABLES=y CONFIG_IP6_NF_MATCH_IPV6HEADER=y -CONFIG_IP6_NF_FILTER=y CONFIG_IP6_NF_TARGET_LOG=y +CONFIG_IP6_NF_FILTER=y CONFIG_IP6_NF_TARGET_REJECT=y CONFIG_IP6_NF_MANGLE=y # CONFIG_IP_DCCP is not set @@ -604,6 +626,7 @@ CONFIG_IP6_NF_MANGLE=y # CONFIG_TIPC is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set +# CONFIG_NET_DSA is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set CONFIG_LLC=y @@ -623,6 +646,7 @@ CONFIG_NET_SCHED=y # CONFIG_NET_SCH_HTB is not set # CONFIG_NET_SCH_HFSC is not set # CONFIG_NET_SCH_PRIO is not set +# CONFIG_NET_SCH_MULTIQ is not set # CONFIG_NET_SCH_RED is not set # CONFIG_NET_SCH_SFQ is not set # CONFIG_NET_SCH_TEQL is not set @@ -630,6 +654,7 @@ CONFIG_NET_SCHED=y # CONFIG_NET_SCH_GRED is not set # CONFIG_NET_SCH_DSMARK is not set # CONFIG_NET_SCH_NETEM is not set +# CONFIG_NET_SCH_DRR is not set # CONFIG_NET_SCH_INGRESS is not set # @@ -644,6 +669,7 @@ CONFIG_NET_CLS=y # CONFIG_NET_CLS_RSVP is not set # CONFIG_NET_CLS_RSVP6 is not set # CONFIG_NET_CLS_FLOW is not set +# CONFIG_NET_CLS_CGROUP is not set CONFIG_NET_EMATCH=y CONFIG_NET_EMATCH_STACK=32 # CONFIG_NET_EMATCH_CMP is not set @@ -659,7 +685,9 @@ CONFIG_NET_CLS_ACT=y # CONFIG_NET_ACT_NAT is not set # CONFIG_NET_ACT_PEDIT is not set # CONFIG_NET_ACT_SIMP is not set +# CONFIG_NET_ACT_SKBEDIT is not set CONFIG_NET_SCH_FIFO=y +# CONFIG_DCB is not set # # Network testing @@ -676,29 +704,33 @@ CONFIG_HAMRADIO=y # CONFIG_IRDA is not set # CONFIG_BT is not set # CONFIG_AF_RXRPC is not set +# CONFIG_PHONET is not set CONFIG_FIB_RULES=y - -# -# Wireless -# +CONFIG_WIRELESS=y CONFIG_CFG80211=y +# CONFIG_CFG80211_REG_DEBUG is not set CONFIG_NL80211=y +CONFIG_WIRELESS_OLD_REGULATORY=y CONFIG_WIRELESS_EXT=y CONFIG_WIRELESS_EXT_SYSFS=y +# CONFIG_LIB80211 is not set CONFIG_MAC80211=y # # Rate control algorithm selection # -CONFIG_MAC80211_RC_PID=y -CONFIG_MAC80211_RC_DEFAULT_PID=y -CONFIG_MAC80211_RC_DEFAULT="pid" +CONFIG_MAC80211_RC_MINSTREL=y +# CONFIG_MAC80211_RC_DEFAULT_PID is not set +CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y +CONFIG_MAC80211_RC_DEFAULT="minstrel" # CONFIG_MAC80211_MESH is not set CONFIG_MAC80211_LEDS=y # CONFIG_MAC80211_DEBUGFS is not set # CONFIG_MAC80211_DEBUG_MENU is not set -# CONFIG_IEEE80211 is not set -# CONFIG_RFKILL is not set +# CONFIG_WIMAX is not set +CONFIG_RFKILL=y +# CONFIG_RFKILL_INPUT is not set +CONFIG_RFKILL_LEDS=y # CONFIG_NET_9P is not set # @@ -722,7 +754,7 @@ CONFIG_PROC_EVENTS=y # CONFIG_MTD is not set # CONFIG_PARPORT is not set CONFIG_PNP=y -# CONFIG_PNP_DEBUG is not set +CONFIG_PNP_DEBUG_MESSAGES=y # # Protocols @@ -750,20 +782,19 @@ CONFIG_BLK_DEV_RAM_SIZE=16384 CONFIG_MISC_DEVICES=y # CONFIG_IBM_ASM is not set # CONFIG_PHANTOM is not set -# CONFIG_EEPROM_93CX6 is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set -# CONFIG_ACER_WMI is not set -# CONFIG_ASUS_LAPTOP is not set -# CONFIG_FUJITSU_LAPTOP is not set -# CONFIG_TC1100_WMI is not set -# CONFIG_MSI_LAPTOP is not set -# CONFIG_COMPAL_LAPTOP is not set -# CONFIG_SONY_LAPTOP is not set -# CONFIG_THINKPAD_ACPI is not set -# CONFIG_INTEL_MENLOW is not set +# CONFIG_ICS932S401 is not set # CONFIG_ENCLOSURE_SERVICES is not set # CONFIG_HP_ILO is not set +# CONFIG_C2PORT is not set + +# +# EEPROM support +# +# CONFIG_EEPROM_AT24 is not set +# CONFIG_EEPROM_LEGACY is not set +# CONFIG_EEPROM_93CX6 is not set CONFIG_HAVE_IDE=y # CONFIG_IDE is not set @@ -802,7 +833,7 @@ CONFIG_SCSI_WAIT_SCAN=m # CONFIG_SCSI_SPI_ATTRS=y # CONFIG_SCSI_FC_ATTRS is not set -CONFIG_SCSI_ISCSI_ATTRS=y +# CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # CONFIG_SCSI_SRP_ATTRS is not set @@ -875,6 +906,7 @@ CONFIG_PATA_OLDPIIX=y CONFIG_PATA_SCH=y CONFIG_MD=y CONFIG_BLK_DEV_MD=y +CONFIG_MD_AUTODETECT=y # CONFIG_MD_LINEAR is not set # CONFIG_MD_RAID0 is not set # CONFIG_MD_RAID1 is not set @@ -930,6 +962,9 @@ CONFIG_PHYLIB=y # CONFIG_BROADCOM_PHY is not set # CONFIG_ICPLUS_PHY is not set # CONFIG_REALTEK_PHY is not set +# CONFIG_NATIONAL_PHY is not set +# CONFIG_STE10XP is not set +# CONFIG_LSI_ET1011C_PHY is not set # CONFIG_FIXED_PHY is not set # CONFIG_MDIO_BITBANG is not set CONFIG_NET_ETHERNET=y @@ -953,6 +988,9 @@ CONFIG_NET_TULIP=y # CONFIG_IBM_NEW_EMAC_RGMII is not set # CONFIG_IBM_NEW_EMAC_TAH is not set # CONFIG_IBM_NEW_EMAC_EMAC4 is not set +# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set +# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set +# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set CONFIG_NET_PCI=y # CONFIG_PCNET32 is not set # CONFIG_AMD8111_ETH is not set @@ -960,7 +998,6 @@ CONFIG_NET_PCI=y # CONFIG_B44 is not set CONFIG_FORCEDETH=y # CONFIG_FORCEDETH_NAPI is not set -# CONFIG_EEPRO100 is not set CONFIG_E100=y # CONFIG_FEALNX is not set # CONFIG_NATSEMI is not set @@ -974,15 +1011,16 @@ CONFIG_8139TOO=y # CONFIG_R6040 is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set +# CONFIG_SMSC9420 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # CONFIG_SC92031 is not set +# CONFIG_ATL2 is not set CONFIG_NETDEV_1000=y # CONFIG_ACENIC is not set # CONFIG_DL2K is not set CONFIG_E1000=y -# CONFIG_E1000_DISABLE_PACKET_SPLIT is not set CONFIG_E1000E=y # CONFIG_IP1000 is not set # CONFIG_IGB is not set @@ -1000,18 +1038,23 @@ CONFIG_BNX2=y # CONFIG_QLA3XXX is not set # CONFIG_ATL1 is not set # CONFIG_ATL1E is not set +# CONFIG_JME is not set CONFIG_NETDEV_10000=y # CONFIG_CHELSIO_T1 is not set +CONFIG_CHELSIO_T3_DEPENDS=y # CONFIG_CHELSIO_T3 is not set +# CONFIG_ENIC is not set # CONFIG_IXGBE is not set # CONFIG_IXGB is not set # CONFIG_S2IO is not set # CONFIG_MYRI10GE is not set # CONFIG_NETXEN_NIC is not set # CONFIG_NIU is not set +# CONFIG_MLX4_EN is not set # CONFIG_MLX4_CORE is not set # CONFIG_TEHUTI is not set # CONFIG_BNX2X is not set +# CONFIG_QLGE is not set # CONFIG_SFC is not set CONFIG_TR=y # CONFIG_IBMOL is not set @@ -1025,9 +1068,8 @@ CONFIG_TR=y # CONFIG_WLAN_PRE80211 is not set CONFIG_WLAN_80211=y # CONFIG_PCMCIA_RAYCS is not set -# CONFIG_IPW2100 is not set -# CONFIG_IPW2200 is not set # CONFIG_LIBERTAS is not set +# CONFIG_LIBERTAS_THINFIRM is not set # CONFIG_AIRO is not set # CONFIG_HERMES is not set # CONFIG_ATMEL is not set @@ -1044,6 +1086,8 @@ CONFIG_WLAN_80211=y CONFIG_ATH5K=y # CONFIG_ATH5K_DEBUG is not set # CONFIG_ATH9K is not set +# CONFIG_IPW2100 is not set +# CONFIG_IPW2200 is not set # CONFIG_IWLCORE is not set # CONFIG_IWLWIFI_LEDS is not set # CONFIG_IWLAGN is not set @@ -1055,6 +1099,10 @@ CONFIG_ATH5K=y # CONFIG_RT2X00 is not set # +# Enable WiMAX (Networking options) to see the WiMAX drivers +# + +# # USB Network Adapters # # CONFIG_USB_CATC is not set @@ -1062,6 +1110,7 @@ CONFIG_ATH5K=y # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET is not set +# CONFIG_USB_HSO is not set CONFIG_NET_PCMCIA=y # CONFIG_PCMCIA_3C589 is not set # CONFIG_PCMCIA_3C574 is not set @@ -1123,6 +1172,7 @@ CONFIG_MOUSE_PS2_LOGIPS2PP=y CONFIG_MOUSE_PS2_SYNAPTICS=y CONFIG_MOUSE_PS2_LIFEBOOK=y CONFIG_MOUSE_PS2_TRACKPOINT=y +# CONFIG_MOUSE_PS2_ELANTECH is not set # CONFIG_MOUSE_PS2_TOUCHKIT is not set # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_APPLETOUCH is not set @@ -1160,15 +1210,16 @@ CONFIG_INPUT_TOUCHSCREEN=y # CONFIG_TOUCHSCREEN_FUJITSU is not set # CONFIG_TOUCHSCREEN_GUNZE is not set # CONFIG_TOUCHSCREEN_ELO is not set +# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set # CONFIG_TOUCHSCREEN_MTOUCH is not set # CONFIG_TOUCHSCREEN_INEXIO is not set # CONFIG_TOUCHSCREEN_MK712 is not set # CONFIG_TOUCHSCREEN_PENMOUNT is not set # CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set # CONFIG_TOUCHSCREEN_TOUCHWIN is not set -# CONFIG_TOUCHSCREEN_UCB1400 is not set # CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set # CONFIG_TOUCHSCREEN_TOUCHIT213 is not set +# CONFIG_TOUCHSCREEN_TSC2007 is not set CONFIG_INPUT_MISC=y # CONFIG_INPUT_PCSPKR is not set # CONFIG_INPUT_APANEL is not set @@ -1179,6 +1230,7 @@ CONFIG_INPUT_MISC=y # CONFIG_INPUT_KEYSPAN_REMOTE is not set # CONFIG_INPUT_POWERMATE is not set # CONFIG_INPUT_YEALINK is not set +# CONFIG_INPUT_CM109 is not set # CONFIG_INPUT_UINPUT is not set # @@ -1245,6 +1297,7 @@ CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y +# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set # CONFIG_LEGACY_PTYS is not set # CONFIG_IPMI_HANDLER is not set CONFIG_HW_RANDOM=y @@ -1279,6 +1332,7 @@ CONFIG_I2C=y CONFIG_I2C_BOARDINFO=y # CONFIG_I2C_CHARDEV is not set CONFIG_I2C_HELPER_AUTO=y +CONFIG_I2C_ALGOBIT=y # # I2C Hardware Bus support @@ -1331,8 +1385,6 @@ CONFIG_I2C_I801=y # Miscellaneous I2C Chip support # # CONFIG_DS1682 is not set -# CONFIG_EEPROM_AT24 is not set -# CONFIG_EEPROM_LEGACY is not set # CONFIG_SENSORS_PCF8574 is not set # CONFIG_PCF8575 is not set # CONFIG_SENSORS_PCA9539 is not set @@ -1351,8 +1403,78 @@ CONFIG_POWER_SUPPLY=y # CONFIG_POWER_SUPPLY_DEBUG is not set # CONFIG_PDA_POWER is not set # CONFIG_BATTERY_DS2760 is not set -# CONFIG_HWMON is not set +# CONFIG_BATTERY_BQ27x00 is not set +CONFIG_HWMON=y +# CONFIG_HWMON_VID is not set +# CONFIG_SENSORS_ABITUGURU is not set +# CONFIG_SENSORS_ABITUGURU3 is not set +# CONFIG_SENSORS_AD7414 is not set +# CONFIG_SENSORS_AD7418 is not set +# CONFIG_SENSORS_ADM1021 is not set +# CONFIG_SENSORS_ADM1025 is not set +# CONFIG_SENSORS_ADM1026 is not set +# CONFIG_SENSORS_ADM1029 is not set +# CONFIG_SENSORS_ADM1031 is not set +# CONFIG_SENSORS_ADM9240 is not set +# CONFIG_SENSORS_ADT7462 is not set +# CONFIG_SENSORS_ADT7470 is not set +# CONFIG_SENSORS_ADT7473 is not set +# CONFIG_SENSORS_ADT7475 is not set +# CONFIG_SENSORS_K8TEMP is not set +# CONFIG_SENSORS_ASB100 is not set +# CONFIG_SENSORS_ATXP1 is not set +# CONFIG_SENSORS_DS1621 is not set +# CONFIG_SENSORS_I5K_AMB is not set +# CONFIG_SENSORS_F71805F is not set +# CONFIG_SENSORS_F71882FG is not set +# CONFIG_SENSORS_F75375S is not set +# CONFIG_SENSORS_FSCHER is not set +# CONFIG_SENSORS_FSCPOS is not set +# CONFIG_SENSORS_FSCHMD is not set +# CONFIG_SENSORS_GL518SM is not set +# CONFIG_SENSORS_GL520SM is not set +# CONFIG_SENSORS_CORETEMP is not set +# CONFIG_SENSORS_IT87 is not set +# CONFIG_SENSORS_LM63 is not set +# CONFIG_SENSORS_LM75 is not set +# CONFIG_SENSORS_LM77 is not set +# CONFIG_SENSORS_LM78 is not set +# CONFIG_SENSORS_LM80 is not set +# CONFIG_SENSORS_LM83 is not set +# CONFIG_SENSORS_LM85 is not set +# CONFIG_SENSORS_LM87 is not set +# CONFIG_SENSORS_LM90 is not set +# CONFIG_SENSORS_LM92 is not set +# CONFIG_SENSORS_LM93 is not set +# CONFIG_SENSORS_LTC4245 is not set +# CONFIG_SENSORS_MAX1619 is not set +# CONFIG_SENSORS_MAX6650 is not set +# CONFIG_SENSORS_PC87360 is not set +# CONFIG_SENSORS_PC87427 is not set +# CONFIG_SENSORS_SIS5595 is not set +# CONFIG_SENSORS_DME1737 is not set +# CONFIG_SENSORS_SMSC47M1 is not set +# CONFIG_SENSORS_SMSC47M192 is not set +# CONFIG_SENSORS_SMSC47B397 is not set +# CONFIG_SENSORS_ADS7828 is not set +# CONFIG_SENSORS_THMC50 is not set +# CONFIG_SENSORS_VIA686A is not set +# CONFIG_SENSORS_VT1211 is not set +# CONFIG_SENSORS_VT8231 is not set +# CONFIG_SENSORS_W83781D is not set +# CONFIG_SENSORS_W83791D is not set +# CONFIG_SENSORS_W83792D is not set +# CONFIG_SENSORS_W83793 is not set +# CONFIG_SENSORS_W83L785TS is not set +# CONFIG_SENSORS_W83L786NG is not set +# CONFIG_SENSORS_W83627HF is not set +# CONFIG_SENSORS_W83627EHF is not set +# CONFIG_SENSORS_HDAPS is not set +# CONFIG_SENSORS_LIS3LV02D is not set +# CONFIG_SENSORS_APPLESMC is not set +# CONFIG_HWMON_DEBUG_CHIP is not set CONFIG_THERMAL=y +# CONFIG_THERMAL_HWMON is not set CONFIG_WATCHDOG=y # CONFIG_WATCHDOG_NOWAYOUT is not set @@ -1372,6 +1494,7 @@ CONFIG_WATCHDOG=y # CONFIG_I6300ESB_WDT is not set # CONFIG_ITCO_WDT is not set # CONFIG_IT8712F_WDT is not set +# CONFIG_IT87_WDT is not set # CONFIG_HP_WATCHDOG is not set # CONFIG_SC1200_WDT is not set # CONFIG_PC87413_WDT is not set @@ -1379,9 +1502,11 @@ CONFIG_WATCHDOG=y # CONFIG_SBC8360_WDT is not set # CONFIG_SBC7240_WDT is not set # CONFIG_CPU5_WDT is not set +# CONFIG_SMSC_SCH311X_WDT is not set # CONFIG_SMSC37B787_WDT is not set # CONFIG_W83627HF_WDT is not set # CONFIG_W83697HF_WDT is not set +# CONFIG_W83697UG_WDT is not set # CONFIG_W83877F_WDT is not set # CONFIG_W83977F_WDT is not set # CONFIG_MACHZ_WDT is not set @@ -1397,11 +1522,11 @@ CONFIG_WATCHDOG=y # USB-based Watchdog Cards # # CONFIG_USBPCWATCHDOG is not set +CONFIG_SSB_POSSIBLE=y # # Sonics Silicon Backplane # -CONFIG_SSB_POSSIBLE=y # CONFIG_SSB is not set # @@ -1410,7 +1535,13 @@ CONFIG_SSB_POSSIBLE=y # CONFIG_MFD_CORE is not set # CONFIG_MFD_SM501 is not set # CONFIG_HTC_PASIC3 is not set +# CONFIG_TWL4030_CORE is not set # CONFIG_MFD_TMIO is not set +# CONFIG_PMIC_DA903X is not set +# CONFIG_MFD_WM8400 is not set +# CONFIG_MFD_WM8350_I2C is not set +# CONFIG_MFD_PCF50633 is not set +# CONFIG_REGULATOR is not set # # Multimedia devices @@ -1450,6 +1581,7 @@ CONFIG_DRM=y # CONFIG_DRM_I810 is not set # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=y +# CONFIG_DRM_I915_KMS is not set # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set @@ -1459,6 +1591,7 @@ CONFIG_DRM_I915=y CONFIG_FB=y # CONFIG_FIRMWARE_EDID is not set # CONFIG_FB_DDC is not set +# CONFIG_FB_BOOT_VESA_SUPPORT is not set CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y @@ -1487,7 +1620,6 @@ CONFIG_FB_TILEBLITTING=y # CONFIG_FB_UVESA is not set # CONFIG_FB_VESA is not set CONFIG_FB_EFI=y -# CONFIG_FB_IMAC is not set # CONFIG_FB_N411 is not set # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set @@ -1503,6 +1635,7 @@ CONFIG_FB_EFI=y # CONFIG_FB_S3 is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set +# CONFIG_FB_VIA is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set @@ -1515,12 +1648,15 @@ CONFIG_FB_EFI=y # CONFIG_FB_CARMINE is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set +# CONFIG_FB_METRONOME is not set +# CONFIG_FB_MB862XX is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y # CONFIG_LCD_CLASS_DEVICE is not set CONFIG_BACKLIGHT_CLASS_DEVICE=y -# CONFIG_BACKLIGHT_CORGI is not set +CONFIG_BACKLIGHT_GENERIC=y # CONFIG_BACKLIGHT_PROGEAR is not set # CONFIG_BACKLIGHT_MBP_NVIDIA is not set +# CONFIG_BACKLIGHT_SAHARA is not set # # Display device support @@ -1540,10 +1676,12 @@ CONFIG_LOGO=y # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_LOGO_LINUX_CLUT224=y CONFIG_SOUND=y +CONFIG_SOUND_OSS_CORE=y CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y CONFIG_SND_HWDEP=y +CONFIG_SND_JACK=y CONFIG_SND_SEQUENCER=y CONFIG_SND_SEQ_DUMMY=y CONFIG_SND_OSSEMUL=y @@ -1551,6 +1689,8 @@ CONFIG_SND_MIXER_OSS=y CONFIG_SND_PCM_OSS=y CONFIG_SND_PCM_OSS_PLUGINS=y CONFIG_SND_SEQUENCER_OSS=y +CONFIG_SND_HRTIMER=y +CONFIG_SND_SEQ_HRTIMER_DEFAULT=y CONFIG_SND_DYNAMIC_MINORS=y CONFIG_SND_SUPPORT_OLD_API=y CONFIG_SND_VERBOSE_PROCFS=y @@ -1605,11 +1745,16 @@ CONFIG_SND_PCI=y # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y CONFIG_SND_HDA_HWDEP=y +# CONFIG_SND_HDA_RECONFIG is not set +# CONFIG_SND_HDA_INPUT_BEEP is not set CONFIG_SND_HDA_CODEC_REALTEK=y CONFIG_SND_HDA_CODEC_ANALOG=y CONFIG_SND_HDA_CODEC_SIGMATEL=y CONFIG_SND_HDA_CODEC_VIA=y CONFIG_SND_HDA_CODEC_ATIHDMI=y +CONFIG_SND_HDA_CODEC_NVHDMI=y +CONFIG_SND_HDA_CODEC_INTELHDMI=y +CONFIG_SND_HDA_ELD=y CONFIG_SND_HDA_CODEC_CONEXANT=y CONFIG_SND_HDA_CODEC_CMEDIA=y CONFIG_SND_HDA_CODEC_SI3054=y @@ -1643,6 +1788,7 @@ CONFIG_SND_USB=y # CONFIG_SND_USB_AUDIO is not set # CONFIG_SND_USB_USX2Y is not set # CONFIG_SND_USB_CAIAQ is not set +# CONFIG_SND_USB_US122L is not set CONFIG_SND_PCMCIA=y # CONFIG_SND_VXPOCKET is not set # CONFIG_SND_PDAUDIOCF is not set @@ -1657,15 +1803,37 @@ CONFIG_HIDRAW=y # USB Input Devices # CONFIG_USB_HID=y -CONFIG_USB_HIDINPUT_POWERBOOK=y -CONFIG_HID_FF=y CONFIG_HID_PID=y +CONFIG_USB_HIDDEV=y + +# +# Special HID drivers +# +CONFIG_HID_COMPAT=y +CONFIG_HID_A4TECH=y +CONFIG_HID_APPLE=y +CONFIG_HID_BELKIN=y +CONFIG_HID_CHERRY=y +CONFIG_HID_CHICONY=y +CONFIG_HID_CYPRESS=y +CONFIG_HID_EZKEY=y +CONFIG_HID_GYRATION=y +CONFIG_HID_LOGITECH=y CONFIG_LOGITECH_FF=y # CONFIG_LOGIRUMBLEPAD2_FF is not set +CONFIG_HID_MICROSOFT=y +CONFIG_HID_MONTEREY=y +CONFIG_HID_NTRIG=y +CONFIG_HID_PANTHERLORD=y CONFIG_PANTHERLORD_FF=y +CONFIG_HID_PETALYNX=y +CONFIG_HID_SAMSUNG=y +CONFIG_HID_SONY=y +CONFIG_HID_SUNPLUS=y +# CONFIG_GREENASIA_FF is not set +CONFIG_HID_TOPSEED=y CONFIG_THRUSTMASTER_FF=y CONFIG_ZEROPLUS_FF=y -CONFIG_USB_HIDDEV=y CONFIG_USB_SUPPORT=y CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y @@ -1683,6 +1851,8 @@ CONFIG_USB_DEVICEFS=y CONFIG_USB_SUSPEND=y # CONFIG_USB_OTG is not set CONFIG_USB_MON=y +# CONFIG_USB_WUSB is not set +# CONFIG_USB_WUSB_CBAF is not set # # USB Host Controller Drivers @@ -1691,6 +1861,7 @@ CONFIG_USB_MON=y CONFIG_USB_EHCI_HCD=y # CONFIG_USB_EHCI_ROOT_HUB_TT is not set # CONFIG_USB_EHCI_TT_NEWSCHED is not set +# CONFIG_USB_OXU210HP_HCD is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_ISP1760_HCD is not set CONFIG_USB_OHCI_HCD=y @@ -1700,6 +1871,8 @@ CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=y # CONFIG_USB_SL811_HCD is not set # CONFIG_USB_R8A66597_HCD is not set +# CONFIG_USB_WHCI_HCD is not set +# CONFIG_USB_HWA_HCD is not set # # USB Device Class drivers @@ -1707,20 +1880,20 @@ CONFIG_USB_UHCI_HCD=y # CONFIG_USB_ACM is not set CONFIG_USB_PRINTER=y # CONFIG_USB_WDM is not set +# CONFIG_USB_TMC is not set # -# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' +# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed; # # -# may also be needed; see USB_STORAGE Help for more information +# see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_ISD200 is not set -# CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set @@ -1728,7 +1901,6 @@ CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_ONETOUCH is not set # CONFIG_USB_STORAGE_KARMA is not set -# CONFIG_USB_STORAGE_SIERRA is not set # CONFIG_USB_STORAGE_CYPRESS_ATACB is not set CONFIG_USB_LIBUSUAL=y @@ -1749,6 +1921,7 @@ CONFIG_USB_LIBUSUAL=y # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set +# CONFIG_USB_SEVSEG is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set @@ -1766,7 +1939,13 @@ CONFIG_USB_LIBUSUAL=y # CONFIG_USB_IOWARRIOR is not set # CONFIG_USB_TEST is not set # CONFIG_USB_ISIGHTFW is not set +# CONFIG_USB_VST is not set # CONFIG_USB_GADGET is not set + +# +# OTG and related infrastructure +# +# CONFIG_UWB is not set # CONFIG_MMC is not set # CONFIG_MEMSTICK is not set CONFIG_NEW_LEDS=y @@ -1775,6 +1954,7 @@ CONFIG_LEDS_CLASS=y # # LED drivers # +# CONFIG_LEDS_ALIX2 is not set # CONFIG_LEDS_PCA9532 is not set # CONFIG_LEDS_CLEVO_MAIL is not set # CONFIG_LEDS_PCA955X is not set @@ -1785,6 +1965,7 @@ CONFIG_LEDS_CLASS=y CONFIG_LEDS_TRIGGERS=y # CONFIG_LEDS_TRIGGER_TIMER is not set # CONFIG_LEDS_TRIGGER_HEARTBEAT is not set +# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set # CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set # CONFIG_ACCESSIBILITY is not set # CONFIG_INFINIBAND is not set @@ -1824,6 +2005,7 @@ CONFIG_RTC_INTF_DEV=y # CONFIG_RTC_DRV_M41T80 is not set # CONFIG_RTC_DRV_S35390A is not set # CONFIG_RTC_DRV_FM3130 is not set +# CONFIG_RTC_DRV_RX8581 is not set # # SPI RTC drivers @@ -1833,12 +2015,15 @@ CONFIG_RTC_INTF_DEV=y # Platform RTC drivers # CONFIG_RTC_DRV_CMOS=y +# CONFIG_RTC_DRV_DS1286 is not set # CONFIG_RTC_DRV_DS1511 is not set # CONFIG_RTC_DRV_DS1553 is not set # CONFIG_RTC_DRV_DS1742 is not set # CONFIG_RTC_DRV_STK17TA8 is not set # CONFIG_RTC_DRV_M48T86 is not set +# CONFIG_RTC_DRV_M48T35 is not set # CONFIG_RTC_DRV_M48T59 is not set +# CONFIG_RTC_DRV_BQ4802 is not set # CONFIG_RTC_DRV_V3020 is not set # @@ -1851,6 +2036,22 @@ CONFIG_DMADEVICES=y # # CONFIG_INTEL_IOATDMA is not set # CONFIG_UIO is not set +# CONFIG_STAGING is not set +CONFIG_X86_PLATFORM_DEVICES=y +# CONFIG_ACER_WMI is not set +# CONFIG_ASUS_LAPTOP is not set +# CONFIG_FUJITSU_LAPTOP is not set +# CONFIG_TC1100_WMI is not set +# CONFIG_MSI_LAPTOP is not set +# CONFIG_PANASONIC_LAPTOP is not set +# CONFIG_COMPAL_LAPTOP is not set +# CONFIG_SONY_LAPTOP is not set +# CONFIG_THINKPAD_ACPI is not set +# CONFIG_INTEL_MENLOW is not set +CONFIG_EEEPC_LAPTOP=y +# CONFIG_ACPI_WMI is not set +# CONFIG_ACPI_ASUS is not set +# CONFIG_ACPI_TOSHIBA is not set # # Firmware Drivers @@ -1861,8 +2062,7 @@ CONFIG_EFI_VARS=y # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set CONFIG_DMIID=y -CONFIG_ISCSI_IBFT_FIND=y -CONFIG_ISCSI_IBFT=y +# CONFIG_ISCSI_IBFT_FIND is not set # # File systems @@ -1872,21 +2072,24 @@ CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y -# CONFIG_EXT4DEV_FS is not set +# CONFIG_EXT4_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y +CONFIG_FILE_LOCKING=y # CONFIG_XFS_FS is not set # CONFIG_OCFS2_FS is not set +# CONFIG_BTRFS_FS is not set CONFIG_DNOTIFY=y CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y CONFIG_QUOTA=y CONFIG_QUOTA_NETLINK_INTERFACE=y # CONFIG_PRINT_QUOTA_WARNING is not set +CONFIG_QUOTA_TREE=y # CONFIG_QFMT_V1 is not set CONFIG_QFMT_V2=y CONFIG_QUOTACTL=y @@ -1920,16 +2123,14 @@ CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_VMCORE=y CONFIG_PROC_SYSCTL=y +CONFIG_PROC_PAGE_MONITOR=y CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y # CONFIG_CONFIGFS_FS is not set - -# -# Miscellaneous filesystems -# +CONFIG_MISC_FILESYSTEMS=y # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_ECRYPT_FS is not set @@ -1939,6 +2140,7 @@ CONFIG_HUGETLB_PAGE=y # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set # CONFIG_CRAMFS is not set +# CONFIG_SQUASHFS is not set # CONFIG_VXFS_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_OMFS_FS is not set @@ -1960,6 +2162,7 @@ CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y CONFIG_SUNRPC_GSS=y +# CONFIG_SUNRPC_REGISTER_V4 is not set CONFIG_RPCSEC_GSS_KRB5=y # CONFIG_RPCSEC_GSS_SPKM3 is not set # CONFIG_SMB_FS is not set @@ -2036,7 +2239,7 @@ CONFIG_NLS_UTF8=y # CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_PRINTK_TIME=y -CONFIG_ENABLE_WARN_DEPRECATED=y +# CONFIG_ENABLE_WARN_DEPRECATED is not set CONFIG_ENABLE_MUST_CHECK=y CONFIG_FRAME_WARN=2048 CONFIG_MAGIC_SYSRQ=y @@ -2066,33 +2269,54 @@ CONFIG_TIMER_STATS=y CONFIG_DEBUG_BUGVERBOSE=y # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_VM is not set +# CONFIG_DEBUG_VIRTUAL is not set # CONFIG_DEBUG_WRITECOUNT is not set CONFIG_DEBUG_MEMORY_INIT=y # CONFIG_DEBUG_LIST is not set # CONFIG_DEBUG_SG is not set +# CONFIG_DEBUG_NOTIFIERS is not set +CONFIG_ARCH_WANT_FRAME_POINTERS=y CONFIG_FRAME_POINTER=y # CONFIG_BOOT_PRINTK_DELAY is not set # CONFIG_RCU_TORTURE_TEST is not set +# CONFIG_RCU_CPU_STALL_DETECTOR is not set # CONFIG_KPROBES_SANITY_TEST is not set # CONFIG_BACKTRACE_SELF_TEST is not set +# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set # CONFIG_LATENCYTOP is not set CONFIG_SYSCTL_SYSCALL_CHECK=y -CONFIG_HAVE_FTRACE=y +CONFIG_USER_STACKTRACE_SUPPORT=y +CONFIG_HAVE_FUNCTION_TRACER=y +CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y +CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y CONFIG_HAVE_DYNAMIC_FTRACE=y -# CONFIG_FTRACE is not set +CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y +CONFIG_HAVE_HW_BRANCH_TRACER=y + +# +# Tracers +# +# CONFIG_FUNCTION_TRACER is not set # CONFIG_IRQSOFF_TRACER is not set # CONFIG_SYSPROF_TRACER is not set # CONFIG_SCHED_TRACER is not set # CONFIG_CONTEXT_SWITCH_TRACER is not set +# CONFIG_BOOT_TRACER is not set +# CONFIG_TRACE_BRANCH_PROFILING is not set +# CONFIG_POWER_TRACER is not set +# CONFIG_STACK_TRACER is not set +# CONFIG_HW_BRANCH_TRACER is not set CONFIG_PROVIDE_OHCI1394_DMA_INIT=y +# CONFIG_DYNAMIC_PRINTK_DEBUG is not set # CONFIG_SAMPLES is not set CONFIG_HAVE_ARCH_KGDB=y # CONFIG_KGDB is not set # CONFIG_STRICT_DEVMEM is not set CONFIG_X86_VERBOSE_BOOTUP=y CONFIG_EARLY_PRINTK=y +CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_STACK_USAGE=y # CONFIG_DEBUG_PAGEALLOC is not set @@ -2123,8 +2347,10 @@ CONFIG_OPTIMIZE_INLINING=y CONFIG_KEYS=y CONFIG_KEYS_DEBUG_PROC_KEYS=y CONFIG_SECURITY=y +# CONFIG_SECURITYFS is not set CONFIG_SECURITY_NETWORK=y # CONFIG_SECURITY_NETWORK_XFRM is not set +# CONFIG_SECURITY_PATH is not set CONFIG_SECURITY_FILE_CAPABILITIES=y # CONFIG_SECURITY_ROOTPLUG is not set CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=65536 @@ -2135,7 +2361,6 @@ CONFIG_SECURITY_SELINUX_DISABLE=y CONFIG_SECURITY_SELINUX_DEVELOP=y CONFIG_SECURITY_SELINUX_AVC_STATS=y CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 -# CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set # CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set # CONFIG_SECURITY_SMACK is not set CONFIG_CRYPTO=y @@ -2143,11 +2368,18 @@ CONFIG_CRYPTO=y # # Crypto core or helper # +# CONFIG_CRYPTO_FIPS is not set CONFIG_CRYPTO_ALGAPI=y +CONFIG_CRYPTO_ALGAPI2=y CONFIG_CRYPTO_AEAD=y +CONFIG_CRYPTO_AEAD2=y CONFIG_CRYPTO_BLKCIPHER=y +CONFIG_CRYPTO_BLKCIPHER2=y CONFIG_CRYPTO_HASH=y +CONFIG_CRYPTO_HASH2=y +CONFIG_CRYPTO_RNG2=y CONFIG_CRYPTO_MANAGER=y +CONFIG_CRYPTO_MANAGER2=y # CONFIG_CRYPTO_GF128MUL is not set # CONFIG_CRYPTO_NULL is not set # CONFIG_CRYPTO_CRYPTD is not set @@ -2182,6 +2414,7 @@ CONFIG_CRYPTO_HMAC=y # Digest # # CONFIG_CRYPTO_CRC32C is not set +# CONFIG_CRYPTO_CRC32C_INTEL is not set # CONFIG_CRYPTO_MD4 is not set CONFIG_CRYPTO_MD5=y # CONFIG_CRYPTO_MICHAEL_MIC is not set @@ -2222,6 +2455,11 @@ CONFIG_CRYPTO_DES=y # # CONFIG_CRYPTO_DEFLATE is not set # CONFIG_CRYPTO_LZO is not set + +# +# Random Number Generation +# +# CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_HW=y # CONFIG_CRYPTO_DEV_PADLOCK is not set # CONFIG_CRYPTO_DEV_GEODE is not set @@ -2239,6 +2477,7 @@ CONFIG_VIRTUALIZATION=y CONFIG_BITREVERSE=y CONFIG_GENERIC_FIND_FIRST_BIT=y CONFIG_GENERIC_FIND_NEXT_BIT=y +CONFIG_GENERIC_FIND_LAST_BIT=y # CONFIG_CRC_CCITT is not set # CONFIG_CRC16 is not set CONFIG_CRC_T10DIF=y Index: linux-2.6-tip/arch/x86/configs/x86_64_defconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/configs/x86_64_defconfig +++ linux-2.6-tip/arch/x86/configs/x86_64_defconfig @@ -1,14 +1,13 @@ # # Automatically generated make config: don't edit -# Linux kernel version: 2.6.27-rc5 -# Wed Sep 3 17:13:39 2008 +# Linux kernel version: 2.6.29-rc4 +# Thu Feb 12 12:57:29 2009 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" -# CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y @@ -23,17 +22,16 @@ CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y +CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y -# CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set -# CONFIG_ARCH_HAS_ILOG2_U32 is not set -# CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y +CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y @@ -42,12 +40,12 @@ CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y -CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y +CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y @@ -76,30 +74,44 @@ CONFIG_TASK_IO_ACCOUNTING=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y + +# +# RCU Subsystem +# +# CONFIG_CLASSIC_RCU is not set +CONFIG_TREE_RCU=y +# CONFIG_PREEMPT_RCU is not set +# CONFIG_RCU_TRACE is not set +CONFIG_RCU_FANOUT=64 +# CONFIG_RCU_FANOUT_EXACT is not set +# CONFIG_TREE_RCU_TRACE is not set +# CONFIG_PREEMPT_RCU_TRACE is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=18 -CONFIG_CGROUPS=y -# CONFIG_CGROUP_DEBUG is not set -CONFIG_CGROUP_NS=y -# CONFIG_CGROUP_DEVICE is not set -CONFIG_CPUSETS=y CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set # CONFIG_USER_SCHED is not set CONFIG_CGROUP_SCHED=y +CONFIG_CGROUPS=y +# CONFIG_CGROUP_DEBUG is not set +CONFIG_CGROUP_NS=y +CONFIG_CGROUP_FREEZER=y +# CONFIG_CGROUP_DEVICE is not set +CONFIG_CPUSETS=y +CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_CPUACCT=y CONFIG_RESOURCE_COUNTERS=y # CONFIG_CGROUP_MEM_RES_CTLR is not set # CONFIG_SYSFS_DEPRECATED_V2 is not set -CONFIG_PROC_PID_CPUSET=y CONFIG_RELAY=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y +CONFIG_NET_NS=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y @@ -124,12 +136,15 @@ CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y +CONFIG_AIO=y CONFIG_VM_EVENT_COUNTERS=y +CONFIG_PCI_QUIRKS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_PROFILING=y +CONFIG_TRACEPOINTS=y CONFIG_MARKERS=y # CONFIG_OPROFILE is not set CONFIG_HAVE_OPROFILE=y @@ -139,15 +154,10 @@ CONFIG_KRETPROBES=y CONFIG_HAVE_IOREMAP_PROT=y CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y -# CONFIG_HAVE_ARCH_TRACEHOOK is not set -# CONFIG_HAVE_DMA_ATTRS is not set -CONFIG_USE_GENERIC_SMP_HELPERS=y -# CONFIG_HAVE_CLK is not set -CONFIG_PROC_PAGE_MONITOR=y +CONFIG_HAVE_ARCH_TRACEHOOK=y # CONFIG_HAVE_GENERIC_DMA_COHERENT is not set CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y -# CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y # CONFIG_MODULE_FORCE_LOAD is not set @@ -155,7 +165,6 @@ CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set -CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_BLK_DEV_IO_TRACE=y @@ -175,7 +184,7 @@ CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" -CONFIG_CLASSIC_RCU=y +CONFIG_FREEZER=y # # Processor type and features @@ -185,13 +194,14 @@ CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y +CONFIG_SPARSE_IRQ=y +# CONFIG_NUMA_MIGRATE_IRQ_DESC is not set CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y -CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set -# CONFIG_X86_VOYAGER is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_VSMP is not set +CONFIG_SCHED_OMIT_FRAME_POINTER=y # CONFIG_PARAVIRT_GUEST is not set # CONFIG_MEMTEST is not set # CONFIG_M386 is not set @@ -230,6 +240,11 @@ CONFIG_X86_CMPXCHG64=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=64 CONFIG_X86_DEBUGCTLMSR=y +CONFIG_CPU_SUP_INTEL=y +CONFIG_CPU_SUP_AMD=y +CONFIG_CPU_SUP_CENTAUR_64=y +CONFIG_X86_DS=y +CONFIG_X86_PTRACE_BTS=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_DMI=y @@ -237,8 +252,11 @@ CONFIG_GART_IOMMU=y CONFIG_CALGARY_IOMMU=y CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y CONFIG_AMD_IOMMU=y +CONFIG_AMD_IOMMU_STATS=y CONFIG_SWIOTLB=y CONFIG_IOMMU_HELPER=y +CONFIG_IOMMU_API=y +# CONFIG_MAXSMP is not set CONFIG_NR_CPUS=64 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y @@ -247,12 +265,17 @@ CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y +CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y # CONFIG_X86_MCE is not set # CONFIG_I8K is not set CONFIG_MICROCODE=y +CONFIG_MICROCODE_INTEL=y +CONFIG_MICROCODE_AMD=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y +CONFIG_ARCH_PHYS_ADDR_T_64BIT=y +CONFIG_DIRECT_GBPAGES=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_X86_64_ACPI_NUMA=y @@ -269,7 +292,6 @@ CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_HAVE_MEMORY_PRESENT=y -# CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y @@ -280,10 +302,14 @@ CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_MIGRATION=y -CONFIG_RESOURCES_64BIT=y +CONFIG_PHYS_ADDR_T_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y +CONFIG_UNEVICTABLE_LRU=y +CONFIG_X86_CHECK_BIOS_CORRUPTION=y +CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y +CONFIG_X86_RESERVE_LOW_64K=y CONFIG_MTRR=y # CONFIG_MTRR_SANITIZER is not set CONFIG_X86_PAT=y @@ -302,11 +328,12 @@ CONFIG_PHYSICAL_START=0x1000000 CONFIG_PHYSICAL_ALIGN=0x200000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set +# CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y # -# Power management options +# Power management and ACPI options # CONFIG_ARCH_HIBERNATION_HEADER=y CONFIG_PM=y @@ -333,20 +360,14 @@ CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=y -# CONFIG_ACPI_BAY is not set CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y CONFIG_ACPI_THERMAL=y CONFIG_ACPI_NUMA=y -# CONFIG_ACPI_WMI is not set -# CONFIG_ACPI_ASUS is not set -# CONFIG_ACPI_TOSHIBA is not set # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set -CONFIG_ACPI_EC=y # CONFIG_ACPI_PCI_SLOT is not set -CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y @@ -381,13 +402,17 @@ CONFIG_X86_ACPI_CPUFREQ=y # # shared options # -# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set CONFIG_CPU_IDLE=y CONFIG_CPU_IDLE_GOV_LADDER=y CONFIG_CPU_IDLE_GOV_MENU=y # +# Memory power savings +# +# CONFIG_I7300_IDLE is not set + +# # Bus options (PCI etc.) # CONFIG_PCI=y @@ -395,8 +420,10 @@ CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCI_DOMAINS=y CONFIG_DMAR=y +# CONFIG_DMAR_DEFAULT_ON is not set CONFIG_DMAR_GFX_WA=y CONFIG_DMAR_FLOPPY_WA=y +# CONFIG_INTR_REMAP is not set CONFIG_PCIEPORTBUS=y # CONFIG_HOTPLUG_PCI_PCIE is not set CONFIG_PCIEAER=y @@ -405,6 +432,7 @@ CONFIG_ARCH_SUPPORTS_MSI=y CONFIG_PCI_MSI=y # CONFIG_PCI_LEGACY is not set # CONFIG_PCI_DEBUG is not set +# CONFIG_PCI_STUB is not set CONFIG_HT_IRQ=y CONFIG_ISA_DMA_API=y CONFIG_K8_NB=y @@ -438,6 +466,8 @@ CONFIG_HOTPLUG_PCI=y # CONFIG_BINFMT_ELF=y CONFIG_COMPAT_BINFMT_ELF=y +CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y +# CONFIG_HAVE_AOUT is not set CONFIG_BINFMT_MISC=y CONFIG_IA32_EMULATION=y # CONFIG_IA32_AOUT is not set @@ -449,6 +479,7 @@ CONFIG_NET=y # # Networking options # +CONFIG_COMPAT_NET_DEV_OPS=y CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y @@ -509,7 +540,6 @@ CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_TCP_MD5SIG=y -# CONFIG_IP_VS is not set CONFIG_IPV6=y # CONFIG_IPV6_PRIVACY is not set # CONFIG_IPV6_ROUTER_PREF is not set @@ -547,19 +577,21 @@ CONFIG_NF_CONNTRACK_IRC=y CONFIG_NF_CONNTRACK_SIP=y CONFIG_NF_CT_NETLINK=y CONFIG_NETFILTER_XTABLES=y +CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y CONFIG_NETFILTER_XT_TARGET_MARK=y CONFIG_NETFILTER_XT_TARGET_NFLOG=y CONFIG_NETFILTER_XT_TARGET_SECMARK=y -CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y CONFIG_NETFILTER_XT_TARGET_TCPMSS=y CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y CONFIG_NETFILTER_XT_MATCH_MARK=y CONFIG_NETFILTER_XT_MATCH_POLICY=y CONFIG_NETFILTER_XT_MATCH_STATE=y +# CONFIG_IP_VS is not set # # IP: Netfilter Configuration # +CONFIG_NF_DEFRAG_IPV4=y CONFIG_NF_CONNTRACK_IPV4=y CONFIG_NF_CONNTRACK_PROC_COMPAT=y CONFIG_IP_NF_IPTABLES=y @@ -585,8 +617,8 @@ CONFIG_IP_NF_MANGLE=y CONFIG_NF_CONNTRACK_IPV6=y CONFIG_IP6_NF_IPTABLES=y CONFIG_IP6_NF_MATCH_IPV6HEADER=y -CONFIG_IP6_NF_FILTER=y CONFIG_IP6_NF_TARGET_LOG=y +CONFIG_IP6_NF_FILTER=y CONFIG_IP6_NF_TARGET_REJECT=y CONFIG_IP6_NF_MANGLE=y # CONFIG_IP_DCCP is not set @@ -594,6 +626,7 @@ CONFIG_IP6_NF_MANGLE=y # CONFIG_TIPC is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set +# CONFIG_NET_DSA is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set CONFIG_LLC=y @@ -613,6 +646,7 @@ CONFIG_NET_SCHED=y # CONFIG_NET_SCH_HTB is not set # CONFIG_NET_SCH_HFSC is not set # CONFIG_NET_SCH_PRIO is not set +# CONFIG_NET_SCH_MULTIQ is not set # CONFIG_NET_SCH_RED is not set # CONFIG_NET_SCH_SFQ is not set # CONFIG_NET_SCH_TEQL is not set @@ -620,6 +654,7 @@ CONFIG_NET_SCHED=y # CONFIG_NET_SCH_GRED is not set # CONFIG_NET_SCH_DSMARK is not set # CONFIG_NET_SCH_NETEM is not set +# CONFIG_NET_SCH_DRR is not set # CONFIG_NET_SCH_INGRESS is not set # @@ -634,6 +669,7 @@ CONFIG_NET_CLS=y # CONFIG_NET_CLS_RSVP is not set # CONFIG_NET_CLS_RSVP6 is not set # CONFIG_NET_CLS_FLOW is not set +# CONFIG_NET_CLS_CGROUP is not set CONFIG_NET_EMATCH=y CONFIG_NET_EMATCH_STACK=32 # CONFIG_NET_EMATCH_CMP is not set @@ -649,7 +685,9 @@ CONFIG_NET_CLS_ACT=y # CONFIG_NET_ACT_NAT is not set # CONFIG_NET_ACT_PEDIT is not set # CONFIG_NET_ACT_SIMP is not set +# CONFIG_NET_ACT_SKBEDIT is not set CONFIG_NET_SCH_FIFO=y +# CONFIG_DCB is not set # # Network testing @@ -666,29 +704,33 @@ CONFIG_HAMRADIO=y # CONFIG_IRDA is not set # CONFIG_BT is not set # CONFIG_AF_RXRPC is not set +# CONFIG_PHONET is not set CONFIG_FIB_RULES=y - -# -# Wireless -# +CONFIG_WIRELESS=y CONFIG_CFG80211=y +# CONFIG_CFG80211_REG_DEBUG is not set CONFIG_NL80211=y +CONFIG_WIRELESS_OLD_REGULATORY=y CONFIG_WIRELESS_EXT=y CONFIG_WIRELESS_EXT_SYSFS=y +# CONFIG_LIB80211 is not set CONFIG_MAC80211=y # # Rate control algorithm selection # -CONFIG_MAC80211_RC_PID=y -CONFIG_MAC80211_RC_DEFAULT_PID=y -CONFIG_MAC80211_RC_DEFAULT="pid" +CONFIG_MAC80211_RC_MINSTREL=y +# CONFIG_MAC80211_RC_DEFAULT_PID is not set +CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y +CONFIG_MAC80211_RC_DEFAULT="minstrel" # CONFIG_MAC80211_MESH is not set CONFIG_MAC80211_LEDS=y # CONFIG_MAC80211_DEBUGFS is not set # CONFIG_MAC80211_DEBUG_MENU is not set -# CONFIG_IEEE80211 is not set -# CONFIG_RFKILL is not set +# CONFIG_WIMAX is not set +CONFIG_RFKILL=y +# CONFIG_RFKILL_INPUT is not set +CONFIG_RFKILL_LEDS=y # CONFIG_NET_9P is not set # @@ -712,7 +754,7 @@ CONFIG_PROC_EVENTS=y # CONFIG_MTD is not set # CONFIG_PARPORT is not set CONFIG_PNP=y -# CONFIG_PNP_DEBUG is not set +CONFIG_PNP_DEBUG_MESSAGES=y # # Protocols @@ -740,21 +782,21 @@ CONFIG_BLK_DEV_RAM_SIZE=16384 CONFIG_MISC_DEVICES=y # CONFIG_IBM_ASM is not set # CONFIG_PHANTOM is not set -# CONFIG_EEPROM_93CX6 is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set -# CONFIG_ACER_WMI is not set -# CONFIG_ASUS_LAPTOP is not set -# CONFIG_FUJITSU_LAPTOP is not set -# CONFIG_MSI_LAPTOP is not set -# CONFIG_COMPAL_LAPTOP is not set -# CONFIG_SONY_LAPTOP is not set -# CONFIG_THINKPAD_ACPI is not set -# CONFIG_INTEL_MENLOW is not set +# CONFIG_ICS932S401 is not set # CONFIG_ENCLOSURE_SERVICES is not set # CONFIG_SGI_XP is not set # CONFIG_HP_ILO is not set # CONFIG_SGI_GRU is not set +# CONFIG_C2PORT is not set + +# +# EEPROM support +# +# CONFIG_EEPROM_AT24 is not set +# CONFIG_EEPROM_LEGACY is not set +# CONFIG_EEPROM_93CX6 is not set CONFIG_HAVE_IDE=y # CONFIG_IDE is not set @@ -793,7 +835,7 @@ CONFIG_SCSI_WAIT_SCAN=m # CONFIG_SCSI_SPI_ATTRS=y # CONFIG_SCSI_FC_ATTRS is not set -CONFIG_SCSI_ISCSI_ATTRS=y +# CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # CONFIG_SCSI_SRP_ATTRS is not set @@ -864,6 +906,7 @@ CONFIG_PATA_OLDPIIX=y CONFIG_PATA_SCH=y CONFIG_MD=y CONFIG_BLK_DEV_MD=y +CONFIG_MD_AUTODETECT=y # CONFIG_MD_LINEAR is not set # CONFIG_MD_RAID0 is not set # CONFIG_MD_RAID1 is not set @@ -919,6 +962,9 @@ CONFIG_PHYLIB=y # CONFIG_BROADCOM_PHY is not set # CONFIG_ICPLUS_PHY is not set # CONFIG_REALTEK_PHY is not set +# CONFIG_NATIONAL_PHY is not set +# CONFIG_STE10XP is not set +# CONFIG_LSI_ET1011C_PHY is not set # CONFIG_FIXED_PHY is not set # CONFIG_MDIO_BITBANG is not set CONFIG_NET_ETHERNET=y @@ -942,6 +988,9 @@ CONFIG_NET_TULIP=y # CONFIG_IBM_NEW_EMAC_RGMII is not set # CONFIG_IBM_NEW_EMAC_TAH is not set # CONFIG_IBM_NEW_EMAC_EMAC4 is not set +# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set +# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set +# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set CONFIG_NET_PCI=y # CONFIG_PCNET32 is not set # CONFIG_AMD8111_ETH is not set @@ -949,7 +998,6 @@ CONFIG_NET_PCI=y # CONFIG_B44 is not set CONFIG_FORCEDETH=y # CONFIG_FORCEDETH_NAPI is not set -# CONFIG_EEPRO100 is not set CONFIG_E100=y # CONFIG_FEALNX is not set # CONFIG_NATSEMI is not set @@ -963,15 +1011,16 @@ CONFIG_8139TOO_PIO=y # CONFIG_R6040 is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set +# CONFIG_SMSC9420 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # CONFIG_SC92031 is not set +# CONFIG_ATL2 is not set CONFIG_NETDEV_1000=y # CONFIG_ACENIC is not set # CONFIG_DL2K is not set CONFIG_E1000=y -# CONFIG_E1000_DISABLE_PACKET_SPLIT is not set # CONFIG_E1000E is not set # CONFIG_IP1000 is not set # CONFIG_IGB is not set @@ -989,18 +1038,23 @@ CONFIG_TIGON3=y # CONFIG_QLA3XXX is not set # CONFIG_ATL1 is not set # CONFIG_ATL1E is not set +# CONFIG_JME is not set CONFIG_NETDEV_10000=y # CONFIG_CHELSIO_T1 is not set +CONFIG_CHELSIO_T3_DEPENDS=y # CONFIG_CHELSIO_T3 is not set +# CONFIG_ENIC is not set # CONFIG_IXGBE is not set # CONFIG_IXGB is not set # CONFIG_S2IO is not set # CONFIG_MYRI10GE is not set # CONFIG_NETXEN_NIC is not set # CONFIG_NIU is not set +# CONFIG_MLX4_EN is not set # CONFIG_MLX4_CORE is not set # CONFIG_TEHUTI is not set # CONFIG_BNX2X is not set +# CONFIG_QLGE is not set # CONFIG_SFC is not set CONFIG_TR=y # CONFIG_IBMOL is not set @@ -1013,9 +1067,8 @@ CONFIG_TR=y # CONFIG_WLAN_PRE80211 is not set CONFIG_WLAN_80211=y # CONFIG_PCMCIA_RAYCS is not set -# CONFIG_IPW2100 is not set -# CONFIG_IPW2200 is not set # CONFIG_LIBERTAS is not set +# CONFIG_LIBERTAS_THINFIRM is not set # CONFIG_AIRO is not set # CONFIG_HERMES is not set # CONFIG_ATMEL is not set @@ -1032,6 +1085,8 @@ CONFIG_WLAN_80211=y CONFIG_ATH5K=y # CONFIG_ATH5K_DEBUG is not set # CONFIG_ATH9K is not set +# CONFIG_IPW2100 is not set +# CONFIG_IPW2200 is not set # CONFIG_IWLCORE is not set # CONFIG_IWLWIFI_LEDS is not set # CONFIG_IWLAGN is not set @@ -1043,6 +1098,10 @@ CONFIG_ATH5K=y # CONFIG_RT2X00 is not set # +# Enable WiMAX (Networking options) to see the WiMAX drivers +# + +# # USB Network Adapters # # CONFIG_USB_CATC is not set @@ -1050,6 +1109,7 @@ CONFIG_ATH5K=y # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET is not set +# CONFIG_USB_HSO is not set CONFIG_NET_PCMCIA=y # CONFIG_PCMCIA_3C589 is not set # CONFIG_PCMCIA_3C574 is not set @@ -1059,6 +1119,7 @@ CONFIG_NET_PCMCIA=y # CONFIG_PCMCIA_SMC91C92 is not set # CONFIG_PCMCIA_XIRC2PS is not set # CONFIG_PCMCIA_AXNET is not set +# CONFIG_PCMCIA_IBMTR is not set # CONFIG_WAN is not set CONFIG_FDDI=y # CONFIG_DEFXX is not set @@ -1110,6 +1171,7 @@ CONFIG_MOUSE_PS2_LOGIPS2PP=y CONFIG_MOUSE_PS2_SYNAPTICS=y CONFIG_MOUSE_PS2_LIFEBOOK=y CONFIG_MOUSE_PS2_TRACKPOINT=y +# CONFIG_MOUSE_PS2_ELANTECH is not set # CONFIG_MOUSE_PS2_TOUCHKIT is not set # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_APPLETOUCH is not set @@ -1147,15 +1209,16 @@ CONFIG_INPUT_TOUCHSCREEN=y # CONFIG_TOUCHSCREEN_FUJITSU is not set # CONFIG_TOUCHSCREEN_GUNZE is not set # CONFIG_TOUCHSCREEN_ELO is not set +# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set # CONFIG_TOUCHSCREEN_MTOUCH is not set # CONFIG_TOUCHSCREEN_INEXIO is not set # CONFIG_TOUCHSCREEN_MK712 is not set # CONFIG_TOUCHSCREEN_PENMOUNT is not set # CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set # CONFIG_TOUCHSCREEN_TOUCHWIN is not set -# CONFIG_TOUCHSCREEN_UCB1400 is not set # CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set # CONFIG_TOUCHSCREEN_TOUCHIT213 is not set +# CONFIG_TOUCHSCREEN_TSC2007 is not set CONFIG_INPUT_MISC=y # CONFIG_INPUT_PCSPKR is not set # CONFIG_INPUT_APANEL is not set @@ -1165,6 +1228,7 @@ CONFIG_INPUT_MISC=y # CONFIG_INPUT_KEYSPAN_REMOTE is not set # CONFIG_INPUT_POWERMATE is not set # CONFIG_INPUT_YEALINK is not set +# CONFIG_INPUT_CM109 is not set # CONFIG_INPUT_UINPUT is not set # @@ -1231,6 +1295,7 @@ CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y +# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set # CONFIG_LEGACY_PTYS is not set # CONFIG_IPMI_HANDLER is not set CONFIG_HW_RANDOM=y @@ -1260,6 +1325,7 @@ CONFIG_I2C=y CONFIG_I2C_BOARDINFO=y # CONFIG_I2C_CHARDEV is not set CONFIG_I2C_HELPER_AUTO=y +CONFIG_I2C_ALGOBIT=y # # I2C Hardware Bus support @@ -1311,8 +1377,6 @@ CONFIG_I2C_I801=y # Miscellaneous I2C Chip support # # CONFIG_DS1682 is not set -# CONFIG_EEPROM_AT24 is not set -# CONFIG_EEPROM_LEGACY is not set # CONFIG_SENSORS_PCF8574 is not set # CONFIG_PCF8575 is not set # CONFIG_SENSORS_PCA9539 is not set @@ -1331,8 +1395,78 @@ CONFIG_POWER_SUPPLY=y # CONFIG_POWER_SUPPLY_DEBUG is not set # CONFIG_PDA_POWER is not set # CONFIG_BATTERY_DS2760 is not set -# CONFIG_HWMON is not set +# CONFIG_BATTERY_BQ27x00 is not set +CONFIG_HWMON=y +# CONFIG_HWMON_VID is not set +# CONFIG_SENSORS_ABITUGURU is not set +# CONFIG_SENSORS_ABITUGURU3 is not set +# CONFIG_SENSORS_AD7414 is not set +# CONFIG_SENSORS_AD7418 is not set +# CONFIG_SENSORS_ADM1021 is not set +# CONFIG_SENSORS_ADM1025 is not set +# CONFIG_SENSORS_ADM1026 is not set +# CONFIG_SENSORS_ADM1029 is not set +# CONFIG_SENSORS_ADM1031 is not set +# CONFIG_SENSORS_ADM9240 is not set +# CONFIG_SENSORS_ADT7462 is not set +# CONFIG_SENSORS_ADT7470 is not set +# CONFIG_SENSORS_ADT7473 is not set +# CONFIG_SENSORS_ADT7475 is not set +# CONFIG_SENSORS_K8TEMP is not set +# CONFIG_SENSORS_ASB100 is not set +# CONFIG_SENSORS_ATXP1 is not set +# CONFIG_SENSORS_DS1621 is not set +# CONFIG_SENSORS_I5K_AMB is not set +# CONFIG_SENSORS_F71805F is not set +# CONFIG_SENSORS_F71882FG is not set +# CONFIG_SENSORS_F75375S is not set +# CONFIG_SENSORS_FSCHER is not set +# CONFIG_SENSORS_FSCPOS is not set +# CONFIG_SENSORS_FSCHMD is not set +# CONFIG_SENSORS_GL518SM is not set +# CONFIG_SENSORS_GL520SM is not set +# CONFIG_SENSORS_CORETEMP is not set +# CONFIG_SENSORS_IT87 is not set +# CONFIG_SENSORS_LM63 is not set +# CONFIG_SENSORS_LM75 is not set +# CONFIG_SENSORS_LM77 is not set +# CONFIG_SENSORS_LM78 is not set +# CONFIG_SENSORS_LM80 is not set +# CONFIG_SENSORS_LM83 is not set +# CONFIG_SENSORS_LM85 is not set +# CONFIG_SENSORS_LM87 is not set +# CONFIG_SENSORS_LM90 is not set +# CONFIG_SENSORS_LM92 is not set +# CONFIG_SENSORS_LM93 is not set +# CONFIG_SENSORS_LTC4245 is not set +# CONFIG_SENSORS_MAX1619 is not set +# CONFIG_SENSORS_MAX6650 is not set +# CONFIG_SENSORS_PC87360 is not set +# CONFIG_SENSORS_PC87427 is not set +# CONFIG_SENSORS_SIS5595 is not set +# CONFIG_SENSORS_DME1737 is not set +# CONFIG_SENSORS_SMSC47M1 is not set +# CONFIG_SENSORS_SMSC47M192 is not set +# CONFIG_SENSORS_SMSC47B397 is not set +# CONFIG_SENSORS_ADS7828 is not set +# CONFIG_SENSORS_THMC50 is not set +# CONFIG_SENSORS_VIA686A is not set +# CONFIG_SENSORS_VT1211 is not set +# CONFIG_SENSORS_VT8231 is not set +# CONFIG_SENSORS_W83781D is not set +# CONFIG_SENSORS_W83791D is not set +# CONFIG_SENSORS_W83792D is not set +# CONFIG_SENSORS_W83793 is not set +# CONFIG_SENSORS_W83L785TS is not set +# CONFIG_SENSORS_W83L786NG is not set +# CONFIG_SENSORS_W83627HF is not set +# CONFIG_SENSORS_W83627EHF is not set +# CONFIG_SENSORS_HDAPS is not set +# CONFIG_SENSORS_LIS3LV02D is not set +# CONFIG_SENSORS_APPLESMC is not set +# CONFIG_HWMON_DEBUG_CHIP is not set CONFIG_THERMAL=y +# CONFIG_THERMAL_HWMON is not set CONFIG_WATCHDOG=y # CONFIG_WATCHDOG_NOWAYOUT is not set @@ -1352,15 +1486,18 @@ CONFIG_WATCHDOG=y # CONFIG_I6300ESB_WDT is not set # CONFIG_ITCO_WDT is not set # CONFIG_IT8712F_WDT is not set +# CONFIG_IT87_WDT is not set # CONFIG_HP_WATCHDOG is not set # CONFIG_SC1200_WDT is not set # CONFIG_PC87413_WDT is not set # CONFIG_60XX_WDT is not set # CONFIG_SBC8360_WDT is not set # CONFIG_CPU5_WDT is not set +# CONFIG_SMSC_SCH311X_WDT is not set # CONFIG_SMSC37B787_WDT is not set # CONFIG_W83627HF_WDT is not set # CONFIG_W83697HF_WDT is not set +# CONFIG_W83697UG_WDT is not set # CONFIG_W83877F_WDT is not set # CONFIG_W83977F_WDT is not set # CONFIG_MACHZ_WDT is not set @@ -1376,11 +1513,11 @@ CONFIG_WATCHDOG=y # USB-based Watchdog Cards # # CONFIG_USBPCWATCHDOG is not set +CONFIG_SSB_POSSIBLE=y # # Sonics Silicon Backplane # -CONFIG_SSB_POSSIBLE=y # CONFIG_SSB is not set # @@ -1389,7 +1526,13 @@ CONFIG_SSB_POSSIBLE=y # CONFIG_MFD_CORE is not set # CONFIG_MFD_SM501 is not set # CONFIG_HTC_PASIC3 is not set +# CONFIG_TWL4030_CORE is not set # CONFIG_MFD_TMIO is not set +# CONFIG_PMIC_DA903X is not set +# CONFIG_MFD_WM8400 is not set +# CONFIG_MFD_WM8350_I2C is not set +# CONFIG_MFD_PCF50633 is not set +# CONFIG_REGULATOR is not set # # Multimedia devices @@ -1423,6 +1566,7 @@ CONFIG_DRM=y # CONFIG_DRM_I810 is not set # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=y +CONFIG_DRM_I915_KMS=y # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set @@ -1432,6 +1576,7 @@ CONFIG_DRM_I915=y CONFIG_FB=y # CONFIG_FIRMWARE_EDID is not set # CONFIG_FB_DDC is not set +# CONFIG_FB_BOOT_VESA_SUPPORT is not set CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y @@ -1460,7 +1605,6 @@ CONFIG_FB_TILEBLITTING=y # CONFIG_FB_UVESA is not set # CONFIG_FB_VESA is not set CONFIG_FB_EFI=y -# CONFIG_FB_IMAC is not set # CONFIG_FB_N411 is not set # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set @@ -1475,6 +1619,7 @@ CONFIG_FB_EFI=y # CONFIG_FB_S3 is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set +# CONFIG_FB_VIA is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set @@ -1486,12 +1631,15 @@ CONFIG_FB_EFI=y # CONFIG_FB_CARMINE is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set +# CONFIG_FB_METRONOME is not set +# CONFIG_FB_MB862XX is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y # CONFIG_LCD_CLASS_DEVICE is not set CONFIG_BACKLIGHT_CLASS_DEVICE=y -# CONFIG_BACKLIGHT_CORGI is not set +CONFIG_BACKLIGHT_GENERIC=y # CONFIG_BACKLIGHT_PROGEAR is not set # CONFIG_BACKLIGHT_MBP_NVIDIA is not set +# CONFIG_BACKLIGHT_SAHARA is not set # # Display device support @@ -1511,10 +1659,12 @@ CONFIG_LOGO=y # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_LOGO_LINUX_CLUT224=y CONFIG_SOUND=y +CONFIG_SOUND_OSS_CORE=y CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y CONFIG_SND_HWDEP=y +CONFIG_SND_JACK=y CONFIG_SND_SEQUENCER=y CONFIG_SND_SEQ_DUMMY=y CONFIG_SND_OSSEMUL=y @@ -1522,6 +1672,8 @@ CONFIG_SND_MIXER_OSS=y CONFIG_SND_PCM_OSS=y CONFIG_SND_PCM_OSS_PLUGINS=y CONFIG_SND_SEQUENCER_OSS=y +CONFIG_SND_HRTIMER=y +CONFIG_SND_SEQ_HRTIMER_DEFAULT=y CONFIG_SND_DYNAMIC_MINORS=y CONFIG_SND_SUPPORT_OLD_API=y CONFIG_SND_VERBOSE_PROCFS=y @@ -1575,11 +1727,16 @@ CONFIG_SND_PCI=y # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y CONFIG_SND_HDA_HWDEP=y +# CONFIG_SND_HDA_RECONFIG is not set +# CONFIG_SND_HDA_INPUT_BEEP is not set CONFIG_SND_HDA_CODEC_REALTEK=y CONFIG_SND_HDA_CODEC_ANALOG=y CONFIG_SND_HDA_CODEC_SIGMATEL=y CONFIG_SND_HDA_CODEC_VIA=y CONFIG_SND_HDA_CODEC_ATIHDMI=y +CONFIG_SND_HDA_CODEC_NVHDMI=y +CONFIG_SND_HDA_CODEC_INTELHDMI=y +CONFIG_SND_HDA_ELD=y CONFIG_SND_HDA_CODEC_CONEXANT=y CONFIG_SND_HDA_CODEC_CMEDIA=y CONFIG_SND_HDA_CODEC_SI3054=y @@ -1612,6 +1769,7 @@ CONFIG_SND_USB=y # CONFIG_SND_USB_AUDIO is not set # CONFIG_SND_USB_USX2Y is not set # CONFIG_SND_USB_CAIAQ is not set +# CONFIG_SND_USB_US122L is not set CONFIG_SND_PCMCIA=y # CONFIG_SND_VXPOCKET is not set # CONFIG_SND_PDAUDIOCF is not set @@ -1626,15 +1784,37 @@ CONFIG_HIDRAW=y # USB Input Devices # CONFIG_USB_HID=y -CONFIG_USB_HIDINPUT_POWERBOOK=y -CONFIG_HID_FF=y CONFIG_HID_PID=y +CONFIG_USB_HIDDEV=y + +# +# Special HID drivers +# +CONFIG_HID_COMPAT=y +CONFIG_HID_A4TECH=y +CONFIG_HID_APPLE=y +CONFIG_HID_BELKIN=y +CONFIG_HID_CHERRY=y +CONFIG_HID_CHICONY=y +CONFIG_HID_CYPRESS=y +CONFIG_HID_EZKEY=y +CONFIG_HID_GYRATION=y +CONFIG_HID_LOGITECH=y CONFIG_LOGITECH_FF=y # CONFIG_LOGIRUMBLEPAD2_FF is not set +CONFIG_HID_MICROSOFT=y +CONFIG_HID_MONTEREY=y +CONFIG_HID_NTRIG=y +CONFIG_HID_PANTHERLORD=y CONFIG_PANTHERLORD_FF=y +CONFIG_HID_PETALYNX=y +CONFIG_HID_SAMSUNG=y +CONFIG_HID_SONY=y +CONFIG_HID_SUNPLUS=y +# CONFIG_GREENASIA_FF is not set +CONFIG_HID_TOPSEED=y CONFIG_THRUSTMASTER_FF=y CONFIG_ZEROPLUS_FF=y -CONFIG_USB_HIDDEV=y CONFIG_USB_SUPPORT=y CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y @@ -1652,6 +1832,8 @@ CONFIG_USB_DEVICEFS=y CONFIG_USB_SUSPEND=y # CONFIG_USB_OTG is not set CONFIG_USB_MON=y +# CONFIG_USB_WUSB is not set +# CONFIG_USB_WUSB_CBAF is not set # # USB Host Controller Drivers @@ -1660,6 +1842,7 @@ CONFIG_USB_MON=y CONFIG_USB_EHCI_HCD=y # CONFIG_USB_EHCI_ROOT_HUB_TT is not set # CONFIG_USB_EHCI_TT_NEWSCHED is not set +# CONFIG_USB_OXU210HP_HCD is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_ISP1760_HCD is not set CONFIG_USB_OHCI_HCD=y @@ -1669,6 +1852,8 @@ CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=y # CONFIG_USB_SL811_HCD is not set # CONFIG_USB_R8A66597_HCD is not set +# CONFIG_USB_WHCI_HCD is not set +# CONFIG_USB_HWA_HCD is not set # # USB Device Class drivers @@ -1676,20 +1861,20 @@ CONFIG_USB_UHCI_HCD=y # CONFIG_USB_ACM is not set CONFIG_USB_PRINTER=y # CONFIG_USB_WDM is not set +# CONFIG_USB_TMC is not set # -# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' +# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed; # # -# may also be needed; see USB_STORAGE Help for more information +# see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_ISD200 is not set -# CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set @@ -1697,7 +1882,6 @@ CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_ONETOUCH is not set # CONFIG_USB_STORAGE_KARMA is not set -# CONFIG_USB_STORAGE_SIERRA is not set # CONFIG_USB_STORAGE_CYPRESS_ATACB is not set CONFIG_USB_LIBUSUAL=y @@ -1718,6 +1902,7 @@ CONFIG_USB_LIBUSUAL=y # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set +# CONFIG_USB_SEVSEG is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set @@ -1735,7 +1920,13 @@ CONFIG_USB_LIBUSUAL=y # CONFIG_USB_IOWARRIOR is not set # CONFIG_USB_TEST is not set # CONFIG_USB_ISIGHTFW is not set +# CONFIG_USB_VST is not set # CONFIG_USB_GADGET is not set + +# +# OTG and related infrastructure +# +# CONFIG_UWB is not set # CONFIG_MMC is not set # CONFIG_MEMSTICK is not set CONFIG_NEW_LEDS=y @@ -1744,6 +1935,7 @@ CONFIG_LEDS_CLASS=y # # LED drivers # +# CONFIG_LEDS_ALIX2 is not set # CONFIG_LEDS_PCA9532 is not set # CONFIG_LEDS_CLEVO_MAIL is not set # CONFIG_LEDS_PCA955X is not set @@ -1754,6 +1946,7 @@ CONFIG_LEDS_CLASS=y CONFIG_LEDS_TRIGGERS=y # CONFIG_LEDS_TRIGGER_TIMER is not set # CONFIG_LEDS_TRIGGER_HEARTBEAT is not set +# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set # CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set # CONFIG_ACCESSIBILITY is not set # CONFIG_INFINIBAND is not set @@ -1793,6 +1986,7 @@ CONFIG_RTC_INTF_DEV=y # CONFIG_RTC_DRV_M41T80 is not set # CONFIG_RTC_DRV_S35390A is not set # CONFIG_RTC_DRV_FM3130 is not set +# CONFIG_RTC_DRV_RX8581 is not set # # SPI RTC drivers @@ -1802,12 +1996,15 @@ CONFIG_RTC_INTF_DEV=y # Platform RTC drivers # CONFIG_RTC_DRV_CMOS=y +# CONFIG_RTC_DRV_DS1286 is not set # CONFIG_RTC_DRV_DS1511 is not set # CONFIG_RTC_DRV_DS1553 is not set # CONFIG_RTC_DRV_DS1742 is not set # CONFIG_RTC_DRV_STK17TA8 is not set # CONFIG_RTC_DRV_M48T86 is not set +# CONFIG_RTC_DRV_M48T35 is not set # CONFIG_RTC_DRV_M48T59 is not set +# CONFIG_RTC_DRV_BQ4802 is not set # CONFIG_RTC_DRV_V3020 is not set # @@ -1820,6 +2017,21 @@ CONFIG_DMADEVICES=y # # CONFIG_INTEL_IOATDMA is not set # CONFIG_UIO is not set +# CONFIG_STAGING is not set +CONFIG_X86_PLATFORM_DEVICES=y +# CONFIG_ACER_WMI is not set +# CONFIG_ASUS_LAPTOP is not set +# CONFIG_FUJITSU_LAPTOP is not set +# CONFIG_MSI_LAPTOP is not set +# CONFIG_PANASONIC_LAPTOP is not set +# CONFIG_COMPAL_LAPTOP is not set +# CONFIG_SONY_LAPTOP is not set +# CONFIG_THINKPAD_ACPI is not set +# CONFIG_INTEL_MENLOW is not set +CONFIG_EEEPC_LAPTOP=y +# CONFIG_ACPI_WMI is not set +# CONFIG_ACPI_ASUS is not set +# CONFIG_ACPI_TOSHIBA is not set # # Firmware Drivers @@ -1830,8 +2042,7 @@ CONFIG_EFI_VARS=y # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set CONFIG_DMIID=y -CONFIG_ISCSI_IBFT_FIND=y -CONFIG_ISCSI_IBFT=y +# CONFIG_ISCSI_IBFT_FIND is not set # # File systems @@ -1841,22 +2052,25 @@ CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y -# CONFIG_EXT4DEV_FS is not set +# CONFIG_EXT4_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y +CONFIG_FILE_LOCKING=y # CONFIG_XFS_FS is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set +# CONFIG_BTRFS_FS is not set CONFIG_DNOTIFY=y CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y CONFIG_QUOTA=y CONFIG_QUOTA_NETLINK_INTERFACE=y # CONFIG_PRINT_QUOTA_WARNING is not set +CONFIG_QUOTA_TREE=y # CONFIG_QFMT_V1 is not set CONFIG_QFMT_V2=y CONFIG_QUOTACTL=y @@ -1890,16 +2104,14 @@ CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_VMCORE=y CONFIG_PROC_SYSCTL=y +CONFIG_PROC_PAGE_MONITOR=y CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y # CONFIG_CONFIGFS_FS is not set - -# -# Miscellaneous filesystems -# +CONFIG_MISC_FILESYSTEMS=y # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_ECRYPT_FS is not set @@ -1909,6 +2121,7 @@ CONFIG_HUGETLB_PAGE=y # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set # CONFIG_CRAMFS is not set +# CONFIG_SQUASHFS is not set # CONFIG_VXFS_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_OMFS_FS is not set @@ -1930,6 +2143,7 @@ CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y CONFIG_SUNRPC_GSS=y +# CONFIG_SUNRPC_REGISTER_V4 is not set CONFIG_RPCSEC_GSS_KRB5=y # CONFIG_RPCSEC_GSS_SPKM3 is not set # CONFIG_SMB_FS is not set @@ -2006,7 +2220,7 @@ CONFIG_NLS_UTF8=y # CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_PRINTK_TIME=y -CONFIG_ENABLE_WARN_DEPRECATED=y +# CONFIG_ENABLE_WARN_DEPRECATED is not set CONFIG_ENABLE_MUST_CHECK=y CONFIG_FRAME_WARN=2048 CONFIG_MAGIC_SYSRQ=y @@ -2035,40 +2249,60 @@ CONFIG_TIMER_STATS=y CONFIG_DEBUG_BUGVERBOSE=y # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_VM is not set +# CONFIG_DEBUG_VIRTUAL is not set # CONFIG_DEBUG_WRITECOUNT is not set CONFIG_DEBUG_MEMORY_INIT=y # CONFIG_DEBUG_LIST is not set # CONFIG_DEBUG_SG is not set +# CONFIG_DEBUG_NOTIFIERS is not set +CONFIG_ARCH_WANT_FRAME_POINTERS=y CONFIG_FRAME_POINTER=y # CONFIG_BOOT_PRINTK_DELAY is not set # CONFIG_RCU_TORTURE_TEST is not set +# CONFIG_RCU_CPU_STALL_DETECTOR is not set # CONFIG_KPROBES_SANITY_TEST is not set # CONFIG_BACKTRACE_SELF_TEST is not set +# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set # CONFIG_LATENCYTOP is not set CONFIG_SYSCTL_SYSCALL_CHECK=y -CONFIG_HAVE_FTRACE=y +CONFIG_USER_STACKTRACE_SUPPORT=y +CONFIG_HAVE_FUNCTION_TRACER=y +CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y +CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y CONFIG_HAVE_DYNAMIC_FTRACE=y -# CONFIG_FTRACE is not set +CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y +CONFIG_HAVE_HW_BRANCH_TRACER=y + +# +# Tracers +# +# CONFIG_FUNCTION_TRACER is not set # CONFIG_IRQSOFF_TRACER is not set # CONFIG_SYSPROF_TRACER is not set # CONFIG_SCHED_TRACER is not set # CONFIG_CONTEXT_SWITCH_TRACER is not set +# CONFIG_BOOT_TRACER is not set +# CONFIG_TRACE_BRANCH_PROFILING is not set +# CONFIG_POWER_TRACER is not set +# CONFIG_STACK_TRACER is not set +# CONFIG_HW_BRANCH_TRACER is not set CONFIG_PROVIDE_OHCI1394_DMA_INIT=y +# CONFIG_DYNAMIC_PRINTK_DEBUG is not set # CONFIG_SAMPLES is not set CONFIG_HAVE_ARCH_KGDB=y # CONFIG_KGDB is not set # CONFIG_STRICT_DEVMEM is not set CONFIG_X86_VERBOSE_BOOTUP=y CONFIG_EARLY_PRINTK=y +CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_STACK_USAGE=y # CONFIG_DEBUG_PAGEALLOC is not set # CONFIG_DEBUG_PER_CPU_MAPS is not set # CONFIG_X86_PTDUMP is not set CONFIG_DEBUG_RODATA=y -# CONFIG_DIRECT_GBPAGES is not set # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_NX_TEST=m # CONFIG_IOMMU_DEBUG is not set @@ -2092,8 +2326,10 @@ CONFIG_OPTIMIZE_INLINING=y CONFIG_KEYS=y CONFIG_KEYS_DEBUG_PROC_KEYS=y CONFIG_SECURITY=y +# CONFIG_SECURITYFS is not set CONFIG_SECURITY_NETWORK=y # CONFIG_SECURITY_NETWORK_XFRM is not set +# CONFIG_SECURITY_PATH is not set CONFIG_SECURITY_FILE_CAPABILITIES=y # CONFIG_SECURITY_ROOTPLUG is not set CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=65536 @@ -2104,7 +2340,6 @@ CONFIG_SECURITY_SELINUX_DISABLE=y CONFIG_SECURITY_SELINUX_DEVELOP=y CONFIG_SECURITY_SELINUX_AVC_STATS=y CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 -# CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set # CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set # CONFIG_SECURITY_SMACK is not set CONFIG_CRYPTO=y @@ -2112,11 +2347,18 @@ CONFIG_CRYPTO=y # # Crypto core or helper # +# CONFIG_CRYPTO_FIPS is not set CONFIG_CRYPTO_ALGAPI=y +CONFIG_CRYPTO_ALGAPI2=y CONFIG_CRYPTO_AEAD=y +CONFIG_CRYPTO_AEAD2=y CONFIG_CRYPTO_BLKCIPHER=y +CONFIG_CRYPTO_BLKCIPHER2=y CONFIG_CRYPTO_HASH=y +CONFIG_CRYPTO_HASH2=y +CONFIG_CRYPTO_RNG2=y CONFIG_CRYPTO_MANAGER=y +CONFIG_CRYPTO_MANAGER2=y # CONFIG_CRYPTO_GF128MUL is not set # CONFIG_CRYPTO_NULL is not set # CONFIG_CRYPTO_CRYPTD is not set @@ -2151,6 +2393,7 @@ CONFIG_CRYPTO_HMAC=y # Digest # # CONFIG_CRYPTO_CRC32C is not set +# CONFIG_CRYPTO_CRC32C_INTEL is not set # CONFIG_CRYPTO_MD4 is not set CONFIG_CRYPTO_MD5=y # CONFIG_CRYPTO_MICHAEL_MIC is not set @@ -2191,6 +2434,11 @@ CONFIG_CRYPTO_DES=y # # CONFIG_CRYPTO_DEFLATE is not set # CONFIG_CRYPTO_LZO is not set + +# +# Random Number Generation +# +# CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_HW=y # CONFIG_CRYPTO_DEV_HIFN_795X is not set CONFIG_HAVE_KVM=y @@ -2205,6 +2453,7 @@ CONFIG_VIRTUALIZATION=y CONFIG_BITREVERSE=y CONFIG_GENERIC_FIND_FIRST_BIT=y CONFIG_GENERIC_FIND_NEXT_BIT=y +CONFIG_GENERIC_FIND_LAST_BIT=y # CONFIG_CRC_CCITT is not set # CONFIG_CRC16 is not set CONFIG_CRC_T10DIF=y Index: linux-2.6-tip/arch/x86/ia32/ia32_signal.c =================================================================== --- linux-2.6-tip.orig/arch/x86/ia32/ia32_signal.c +++ linux-2.6-tip/arch/x86/ia32/ia32_signal.c @@ -33,8 +33,6 @@ #include #include -#define DEBUG_SIG 0 - #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) #define FIX_EFLAGS (X86_EFLAGS_AC | X86_EFLAGS_OF | \ @@ -46,78 +44,83 @@ void signal_fault(struct pt_regs *regs, int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) { - int err; + int err = 0; if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t))) return -EFAULT; - /* If you change siginfo_t structure, please make sure that - this code is fixed accordingly. - It should never copy any pad contained in the structure - to avoid security leaks, but must copy the generic - 3 ints plus the relevant union member. */ - err = __put_user(from->si_signo, &to->si_signo); - err |= __put_user(from->si_errno, &to->si_errno); - err |= __put_user((short)from->si_code, &to->si_code); - - if (from->si_code < 0) { - err |= __put_user(from->si_pid, &to->si_pid); - err |= __put_user(from->si_uid, &to->si_uid); - err |= __put_user(ptr_to_compat(from->si_ptr), &to->si_ptr); - } else { - /* - * First 32bits of unions are always present: - * si_pid === si_band === si_tid === si_addr(LS half) - */ - err |= __put_user(from->_sifields._pad[0], - &to->_sifields._pad[0]); - switch (from->si_code >> 16) { - case __SI_FAULT >> 16: - break; - case __SI_CHLD >> 16: - err |= __put_user(from->si_utime, &to->si_utime); - err |= __put_user(from->si_stime, &to->si_stime); - err |= __put_user(from->si_status, &to->si_status); - /* FALL THROUGH */ - default: - case __SI_KILL >> 16: - err |= __put_user(from->si_uid, &to->si_uid); - break; - case __SI_POLL >> 16: - err |= __put_user(from->si_fd, &to->si_fd); - break; - case __SI_TIMER >> 16: - err |= __put_user(from->si_overrun, &to->si_overrun); - err |= __put_user(ptr_to_compat(from->si_ptr), - &to->si_ptr); - break; - /* This is not generated by the kernel as of now. */ - case __SI_RT >> 16: - case __SI_MESGQ >> 16: - err |= __put_user(from->si_uid, &to->si_uid); - err |= __put_user(from->si_int, &to->si_int); - break; + put_user_try { + /* If you change siginfo_t structure, please make sure that + this code is fixed accordingly. + It should never copy any pad contained in the structure + to avoid security leaks, but must copy the generic + 3 ints plus the relevant union member. */ + put_user_ex(from->si_signo, &to->si_signo); + put_user_ex(from->si_errno, &to->si_errno); + put_user_ex((short)from->si_code, &to->si_code); + + if (from->si_code < 0) { + put_user_ex(from->si_pid, &to->si_pid); + put_user_ex(from->si_uid, &to->si_uid); + put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); + } else { + /* + * First 32bits of unions are always present: + * si_pid === si_band === si_tid === si_addr(LS half) + */ + put_user_ex(from->_sifields._pad[0], + &to->_sifields._pad[0]); + switch (from->si_code >> 16) { + case __SI_FAULT >> 16: + break; + case __SI_CHLD >> 16: + put_user_ex(from->si_utime, &to->si_utime); + put_user_ex(from->si_stime, &to->si_stime); + put_user_ex(from->si_status, &to->si_status); + /* FALL THROUGH */ + default: + case __SI_KILL >> 16: + put_user_ex(from->si_uid, &to->si_uid); + break; + case __SI_POLL >> 16: + put_user_ex(from->si_fd, &to->si_fd); + break; + case __SI_TIMER >> 16: + put_user_ex(from->si_overrun, &to->si_overrun); + put_user_ex(ptr_to_compat(from->si_ptr), + &to->si_ptr); + break; + /* This is not generated by the kernel as of now. */ + case __SI_RT >> 16: + case __SI_MESGQ >> 16: + put_user_ex(from->si_uid, &to->si_uid); + put_user_ex(from->si_int, &to->si_int); + break; + } } - } + } put_user_catch(err); + return err; } int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from) { - int err; + int err = 0; u32 ptr32; if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t))) return -EFAULT; - err = __get_user(to->si_signo, &from->si_signo); - err |= __get_user(to->si_errno, &from->si_errno); - err |= __get_user(to->si_code, &from->si_code); - - err |= __get_user(to->si_pid, &from->si_pid); - err |= __get_user(to->si_uid, &from->si_uid); - err |= __get_user(ptr32, &from->si_ptr); - to->si_ptr = compat_ptr(ptr32); + get_user_try { + get_user_ex(to->si_signo, &from->si_signo); + get_user_ex(to->si_errno, &from->si_errno); + get_user_ex(to->si_code, &from->si_code); + + get_user_ex(to->si_pid, &from->si_pid); + get_user_ex(to->si_uid, &from->si_uid); + get_user_ex(ptr32, &from->si_ptr); + to->si_ptr = compat_ptr(ptr32); + } get_user_catch(err); return err; } @@ -142,17 +145,23 @@ asmlinkage long sys32_sigaltstack(const struct pt_regs *regs) { stack_t uss, uoss; - int ret; + int ret, err = 0; mm_segment_t seg; if (uss_ptr) { u32 ptr; memset(&uss, 0, sizeof(stack_t)); - if (!access_ok(VERIFY_READ, uss_ptr, sizeof(stack_ia32_t)) || - __get_user(ptr, &uss_ptr->ss_sp) || - __get_user(uss.ss_flags, &uss_ptr->ss_flags) || - __get_user(uss.ss_size, &uss_ptr->ss_size)) + if (!access_ok(VERIFY_READ, uss_ptr, sizeof(stack_ia32_t))) + return -EFAULT; + + get_user_try { + get_user_ex(ptr, &uss_ptr->ss_sp); + get_user_ex(uss.ss_flags, &uss_ptr->ss_flags); + get_user_ex(uss.ss_size, &uss_ptr->ss_size); + } get_user_catch(err); + + if (err) return -EFAULT; uss.ss_sp = compat_ptr(ptr); } @@ -161,10 +170,16 @@ asmlinkage long sys32_sigaltstack(const ret = do_sigaltstack(uss_ptr ? &uss : NULL, &uoss, regs->sp); set_fs(seg); if (ret >= 0 && uoss_ptr) { - if (!access_ok(VERIFY_WRITE, uoss_ptr, sizeof(stack_ia32_t)) || - __put_user(ptr_to_compat(uoss.ss_sp), &uoss_ptr->ss_sp) || - __put_user(uoss.ss_flags, &uoss_ptr->ss_flags) || - __put_user(uoss.ss_size, &uoss_ptr->ss_size)) + if (!access_ok(VERIFY_WRITE, uoss_ptr, sizeof(stack_ia32_t))) + return -EFAULT; + + put_user_try { + put_user_ex(ptr_to_compat(uoss.ss_sp), &uoss_ptr->ss_sp); + put_user_ex(uoss.ss_flags, &uoss_ptr->ss_flags); + put_user_ex(uoss.ss_size, &uoss_ptr->ss_size); + } put_user_catch(err); + + if (err) ret = -EFAULT; } return ret; @@ -173,75 +188,78 @@ asmlinkage long sys32_sigaltstack(const /* * Do a signal return; undo the signal stack. */ +#define loadsegment_gs(v) load_gs_index(v) +#define loadsegment_fs(v) loadsegment(fs, v) +#define loadsegment_ds(v) loadsegment(ds, v) +#define loadsegment_es(v) loadsegment(es, v) + +#define get_user_seg(seg) ({ unsigned int v; savesegment(seg, v); v; }) +#define set_user_seg(seg, v) loadsegment_##seg(v) + #define COPY(x) { \ - err |= __get_user(regs->x, &sc->x); \ + get_user_ex(regs->x, &sc->x); \ } -#define COPY_SEG_CPL3(seg) { \ - unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ - regs->seg = tmp | 3; \ -} +#define GET_SEG(seg) ({ \ + unsigned short tmp; \ + get_user_ex(tmp, &sc->seg); \ + tmp; \ +}) + +#define COPY_SEG_CPL3(seg) do { \ + regs->seg = GET_SEG(seg) | 3; \ +} while (0) #define RELOAD_SEG(seg) { \ - unsigned int cur, pre; \ - err |= __get_user(pre, &sc->seg); \ - savesegment(seg, cur); \ + unsigned int pre = GET_SEG(seg); \ + unsigned int cur = get_user_seg(seg); \ pre |= 3; \ if (pre != cur) \ - loadsegment(seg, pre); \ + set_user_seg(seg, pre); \ } static int ia32_restore_sigcontext(struct pt_regs *regs, struct sigcontext_ia32 __user *sc, unsigned int *pax) { - unsigned int tmpflags, gs, oldgs, err = 0; + unsigned int tmpflags, err = 0; void __user *buf; u32 tmp; /* Always make any pending restarted system calls return -EINTR */ current_thread_info()->restart_block.fn = do_no_restart_syscall; -#if DEBUG_SIG - printk(KERN_DEBUG "SIG restore_sigcontext: " - "sc=%p err(%x) eip(%x) cs(%x) flg(%x)\n", - sc, sc->err, sc->ip, sc->cs, sc->flags); -#endif - - /* - * Reload fs and gs if they have changed in the signal - * handler. This does not handle long fs/gs base changes in - * the handler, but does not clobber them at least in the - * normal case. - */ - err |= __get_user(gs, &sc->gs); - gs |= 3; - savesegment(gs, oldgs); - if (gs != oldgs) - load_gs_index(gs); - - RELOAD_SEG(fs); - RELOAD_SEG(ds); - RELOAD_SEG(es); - - COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); - COPY(dx); COPY(cx); COPY(ip); - /* Don't touch extended registers */ - - COPY_SEG_CPL3(cs); - COPY_SEG_CPL3(ss); - - err |= __get_user(tmpflags, &sc->flags); - regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); - /* disable syscall checks */ - regs->orig_ax = -1; - - err |= __get_user(tmp, &sc->fpstate); - buf = compat_ptr(tmp); - err |= restore_i387_xstate_ia32(buf); + get_user_try { + /* + * Reload fs and gs if they have changed in the signal + * handler. This does not handle long fs/gs base changes in + * the handler, but does not clobber them at least in the + * normal case. + */ + RELOAD_SEG(gs); + RELOAD_SEG(fs); + RELOAD_SEG(ds); + RELOAD_SEG(es); + + COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); + COPY(dx); COPY(cx); COPY(ip); + /* Don't touch extended registers */ + + COPY_SEG_CPL3(cs); + COPY_SEG_CPL3(ss); + + get_user_ex(tmpflags, &sc->flags); + regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); + /* disable syscall checks */ + regs->orig_ax = -1; + + get_user_ex(tmp, &sc->fpstate); + buf = compat_ptr(tmp); + err |= restore_i387_xstate_ia32(buf); + + get_user_ex(*pax, &sc->ax); + } get_user_catch(err); - err |= __get_user(*pax, &sc->ax); return err; } @@ -317,38 +335,36 @@ static int ia32_setup_sigcontext(struct void __user *fpstate, struct pt_regs *regs, unsigned int mask) { - int tmp, err = 0; + int err = 0; - savesegment(gs, tmp); - err |= __put_user(tmp, (unsigned int __user *)&sc->gs); - savesegment(fs, tmp); - err |= __put_user(tmp, (unsigned int __user *)&sc->fs); - savesegment(ds, tmp); - err |= __put_user(tmp, (unsigned int __user *)&sc->ds); - savesegment(es, tmp); - err |= __put_user(tmp, (unsigned int __user *)&sc->es); - - err |= __put_user(regs->di, &sc->di); - err |= __put_user(regs->si, &sc->si); - err |= __put_user(regs->bp, &sc->bp); - err |= __put_user(regs->sp, &sc->sp); - err |= __put_user(regs->bx, &sc->bx); - err |= __put_user(regs->dx, &sc->dx); - err |= __put_user(regs->cx, &sc->cx); - err |= __put_user(regs->ax, &sc->ax); - err |= __put_user(current->thread.trap_no, &sc->trapno); - err |= __put_user(current->thread.error_code, &sc->err); - err |= __put_user(regs->ip, &sc->ip); - err |= __put_user(regs->cs, (unsigned int __user *)&sc->cs); - err |= __put_user(regs->flags, &sc->flags); - err |= __put_user(regs->sp, &sc->sp_at_signal); - err |= __put_user(regs->ss, (unsigned int __user *)&sc->ss); - - err |= __put_user(ptr_to_compat(fpstate), &sc->fpstate); - - /* non-iBCS2 extensions.. */ - err |= __put_user(mask, &sc->oldmask); - err |= __put_user(current->thread.cr2, &sc->cr2); + put_user_try { + put_user_ex(get_user_seg(gs), (unsigned int __user *)&sc->gs); + put_user_ex(get_user_seg(fs), (unsigned int __user *)&sc->fs); + put_user_ex(get_user_seg(ds), (unsigned int __user *)&sc->ds); + put_user_ex(get_user_seg(es), (unsigned int __user *)&sc->es); + + put_user_ex(regs->di, &sc->di); + put_user_ex(regs->si, &sc->si); + put_user_ex(regs->bp, &sc->bp); + put_user_ex(regs->sp, &sc->sp); + put_user_ex(regs->bx, &sc->bx); + put_user_ex(regs->dx, &sc->dx); + put_user_ex(regs->cx, &sc->cx); + put_user_ex(regs->ax, &sc->ax); + put_user_ex(current->thread.trap_no, &sc->trapno); + put_user_ex(current->thread.error_code, &sc->err); + put_user_ex(regs->ip, &sc->ip); + put_user_ex(regs->cs, (unsigned int __user *)&sc->cs); + put_user_ex(regs->flags, &sc->flags); + put_user_ex(regs->sp, &sc->sp_at_signal); + put_user_ex(regs->ss, (unsigned int __user *)&sc->ss); + + put_user_ex(ptr_to_compat(fpstate), &sc->fpstate); + + /* non-iBCS2 extensions.. */ + put_user_ex(mask, &sc->oldmask); + put_user_ex(current->thread.cr2, &sc->cr2); + } put_user_catch(err); return err; } @@ -437,13 +453,17 @@ int ia32_setup_frame(int sig, struct k_s else restorer = &frame->retcode; } - err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); - /* - * These are actually not used anymore, but left because some - * gdb versions depend on them as a marker. - */ - err |= __put_user(*((u64 *)&code), (u64 *)frame->retcode); + put_user_try { + put_user_ex(ptr_to_compat(restorer), &frame->pretcode); + + /* + * These are actually not used anymore, but left because some + * gdb versions depend on them as a marker. + */ + put_user_ex(*((u64 *)&code), (u64 *)frame->retcode); + } put_user_catch(err); + if (err) return -EFAULT; @@ -462,11 +482,6 @@ int ia32_setup_frame(int sig, struct k_s regs->cs = __USER32_CS; regs->ss = __USER32_DS; -#if DEBUG_SIG - printk(KERN_DEBUG "SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", - current->comm, current->pid, frame, regs->ip, frame->pretcode); -#endif - return 0; } @@ -496,41 +511,40 @@ int ia32_setup_rt_frame(int sig, struct if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) return -EFAULT; - err |= __put_user(sig, &frame->sig); - err |= __put_user(ptr_to_compat(&frame->info), &frame->pinfo); - err |= __put_user(ptr_to_compat(&frame->uc), &frame->puc); - err |= copy_siginfo_to_user32(&frame->info, info); - if (err) - return -EFAULT; + put_user_try { + put_user_ex(sig, &frame->sig); + put_user_ex(ptr_to_compat(&frame->info), &frame->pinfo); + put_user_ex(ptr_to_compat(&frame->uc), &frame->puc); + err |= copy_siginfo_to_user32(&frame->info, info); + + /* Create the ucontext. */ + if (cpu_has_xsave) + put_user_ex(UC_FP_XSTATE, &frame->uc.uc_flags); + else + put_user_ex(0, &frame->uc.uc_flags); + put_user_ex(0, &frame->uc.uc_link); + put_user_ex(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); + put_user_ex(sas_ss_flags(regs->sp), + &frame->uc.uc_stack.ss_flags); + put_user_ex(current->sas_ss_size, &frame->uc.uc_stack.ss_size); + err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, fpstate, + regs, set->sig[0]); + err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - /* Create the ucontext. */ - if (cpu_has_xsave) - err |= __put_user(UC_FP_XSTATE, &frame->uc.uc_flags); - else - err |= __put_user(0, &frame->uc.uc_flags); - err |= __put_user(0, &frame->uc.uc_link); - err |= __put_user(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); - err |= __put_user(sas_ss_flags(regs->sp), - &frame->uc.uc_stack.ss_flags); - err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size); - err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, fpstate, - regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - if (err) - return -EFAULT; + if (ka->sa.sa_flags & SA_RESTORER) + restorer = ka->sa.sa_restorer; + else + restorer = VDSO32_SYMBOL(current->mm->context.vdso, + rt_sigreturn); + put_user_ex(ptr_to_compat(restorer), &frame->pretcode); + + /* + * Not actually used anymore, but left because some gdb + * versions need it. + */ + put_user_ex(*((u64 *)&code), (u64 *)frame->retcode); + } put_user_catch(err); - if (ka->sa.sa_flags & SA_RESTORER) - restorer = ka->sa.sa_restorer; - else - restorer = VDSO32_SYMBOL(current->mm->context.vdso, - rt_sigreturn); - err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); - - /* - * Not actually used anymore, but left because some gdb - * versions need it. - */ - err |= __put_user(*((u64 *)&code), (u64 *)frame->retcode); if (err) return -EFAULT; @@ -549,10 +563,5 @@ int ia32_setup_rt_frame(int sig, struct regs->cs = __USER32_CS; regs->ss = __USER32_DS; -#if DEBUG_SIG - printk(KERN_DEBUG "SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", - current->comm, current->pid, frame, regs->ip, frame->pretcode); -#endif - return 0; } Index: linux-2.6-tip/arch/x86/ia32/ia32entry.S =================================================================== --- linux-2.6-tip.orig/arch/x86/ia32/ia32entry.S +++ linux-2.6-tip/arch/x86/ia32/ia32entry.S @@ -112,8 +112,8 @@ ENTRY(ia32_sysenter_target) CFI_DEF_CFA rsp,0 CFI_REGISTER rsp,rbp SWAPGS_UNSAFE_STACK - movq %gs:pda_kernelstack, %rsp - addq $(PDA_STACKOFFSET),%rsp + movq PER_CPU_VAR(kernel_stack), %rsp + addq $(KERNEL_STACK_OFFSET),%rsp /* * No need to follow this irqs on/off section: the syscall * disabled irqs, here we enable it straight after entry: @@ -273,13 +273,13 @@ ENDPROC(ia32_sysenter_target) ENTRY(ia32_cstar_target) CFI_STARTPROC32 simple CFI_SIGNAL_FRAME - CFI_DEF_CFA rsp,PDA_STACKOFFSET + CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET CFI_REGISTER rip,rcx /*CFI_REGISTER rflags,r11*/ SWAPGS_UNSAFE_STACK movl %esp,%r8d CFI_REGISTER rsp,r8 - movq %gs:pda_kernelstack,%rsp + movq PER_CPU_VAR(kernel_stack),%rsp /* * No need to follow this irqs on/off section: the syscall * disabled irqs and here we enable it straight after entry: @@ -825,7 +825,8 @@ ia32_sys_call_table: .quad compat_sys_signalfd4 .quad sys_eventfd2 .quad sys_epoll_create1 - .quad sys_dup3 /* 330 */ + .quad sys_dup3 /* 330 */ .quad sys_pipe2 .quad sys_inotify_init1 + .quad sys_perf_counter_open ia32_syscall_end: Index: linux-2.6-tip/arch/x86/include/asm/a.out-core.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/a.out-core.h +++ linux-2.6-tip/arch/x86/include/asm/a.out-core.h @@ -55,7 +55,7 @@ static inline void aout_dump_thread(stru dump->regs.ds = (u16)regs->ds; dump->regs.es = (u16)regs->es; dump->regs.fs = (u16)regs->fs; - savesegment(gs, dump->regs.gs); + dump->regs.gs = get_user_gs(regs); dump->regs.orig_ax = regs->orig_ax; dump->regs.ip = regs->ip; dump->regs.cs = (u16)regs->cs; Index: linux-2.6-tip/arch/x86/include/asm/acpi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/acpi.h +++ linux-2.6-tip/arch/x86/include/asm/acpi.h @@ -50,8 +50,8 @@ #define ACPI_ASM_MACROS #define BREAKPOINT3 -#define ACPI_DISABLE_IRQS() local_irq_disable() -#define ACPI_ENABLE_IRQS() local_irq_enable() +#define ACPI_DISABLE_IRQS() local_irq_disable_nort() +#define ACPI_ENABLE_IRQS() local_irq_enable_nort() #define ACPI_FLUSH_CPU_CACHE() wbinvd() int __acpi_acquire_global_lock(unsigned int *lock); @@ -102,9 +102,6 @@ static inline void disable_acpi(void) acpi_noirq = 1; } -/* Fixmap pages to reserve for ACPI boot-time tables (see fixmap.h) */ -#define FIX_ACPI_PAGES 4 - extern int acpi_gsi_to_irq(u32 gsi, unsigned int *irq); static inline void acpi_noirq_set(void) { acpi_noirq = 1; } Index: linux-2.6-tip/arch/x86/include/asm/apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/apic.h +++ linux-2.6-tip/arch/x86/include/asm/apic.h @@ -1,15 +1,18 @@ #ifndef _ASM_X86_APIC_H #define _ASM_X86_APIC_H -#include +#include #include +#include #include -#include -#include +#include #include +#include +#include +#include +#include #include -#include #include #define ARCH_APICTIMER_STOPS_ON_C3 1 @@ -33,7 +36,13 @@ } while (0) +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32) extern void generic_apic_probe(void); +#else +static inline void generic_apic_probe(void) +{ +} +#endif #ifdef CONFIG_X86_LOCAL_APIC @@ -41,6 +50,21 @@ extern unsigned int apic_verbosity; extern int local_apic_timer_c2_ok; extern int disable_apic; + +#ifdef CONFIG_SMP +extern void __inquire_remote_apic(int apicid); +#else /* CONFIG_SMP */ +static inline void __inquire_remote_apic(int apicid) +{ +} +#endif /* CONFIG_SMP */ + +static inline void default_inquire_remote_apic(int apicid) +{ + if (apic_verbosity >= APIC_DEBUG) + __inquire_remote_apic(apicid); +} + /* * Basic functions accessing APICs. */ @@ -71,6 +95,12 @@ static inline u32 native_apic_mem_read(u return *((volatile u32 *)(APIC_BASE + reg)); } +extern void native_apic_wait_icr_idle(void); +extern u32 native_safe_apic_wait_icr_idle(void); +extern void native_apic_icr_write(u32 low, u32 id); +extern u64 native_apic_icr_read(void); + +#ifdef CONFIG_X86_X2APIC static inline void native_apic_msr_write(u32 reg, u32 v) { if (reg == APIC_DFR || reg == APIC_ID || reg == APIC_LDR || @@ -91,8 +121,32 @@ static inline u32 native_apic_msr_read(u return low; } -#ifndef CONFIG_X86_32 -extern int x2apic; +static inline void native_x2apic_wait_icr_idle(void) +{ + /* no need to wait for icr idle in x2apic */ + return; +} + +static inline u32 native_safe_x2apic_wait_icr_idle(void) +{ + /* no need to wait for icr idle in x2apic */ + return 0; +} + +static inline void native_x2apic_icr_write(u32 low, u32 id) +{ + wrmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), ((__u64) id) << 32 | low); +} + +static inline u64 native_x2apic_icr_read(void) +{ + unsigned long val; + + rdmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), val); + return val; +} + +extern int x2apic, x2apic_phys; extern void check_x2apic(void); extern void enable_x2apic(void); extern void enable_IR_x2apic(void); @@ -110,30 +164,24 @@ static inline int x2apic_enabled(void) return 0; } #else -#define x2apic_enabled() 0 +static inline void check_x2apic(void) +{ +} +static inline void enable_x2apic(void) +{ +} +static inline void enable_IR_x2apic(void) +{ +} +static inline int x2apic_enabled(void) +{ + return 0; +} #endif -struct apic_ops { - u32 (*read)(u32 reg); - void (*write)(u32 reg, u32 v); - u64 (*icr_read)(void); - void (*icr_write)(u32 low, u32 high); - void (*wait_icr_idle)(void); - u32 (*safe_wait_icr_idle)(void); -}; - -extern struct apic_ops *apic_ops; - -#define apic_read (apic_ops->read) -#define apic_write (apic_ops->write) -#define apic_icr_read (apic_ops->icr_read) -#define apic_icr_write (apic_ops->icr_write) -#define apic_wait_icr_idle (apic_ops->wait_icr_idle) -#define safe_apic_wait_icr_idle (apic_ops->safe_wait_icr_idle) - extern int get_physical_broadcast(void); -#ifdef CONFIG_X86_64 +#ifdef CONFIG_X86_X2APIC static inline void ack_x2APIC_irq(void) { /* Docs say use 0 for future compatibility */ @@ -141,18 +189,6 @@ static inline void ack_x2APIC_irq(void) } #endif - -static inline void ack_APIC_irq(void) -{ - /* - * ack_APIC_irq() actually gets compiled as a single instruction - * ... yummie. - */ - - /* Docs say use 0 for future compatibility */ - apic_write(APIC_EOI, 0); -} - extern int lapic_get_maxlvt(void); extern void clear_local_APIC(void); extern void connect_bsp_APIC(void); @@ -196,4 +232,316 @@ static inline void disable_local_APIC(vo #endif /* !CONFIG_X86_LOCAL_APIC */ +#ifdef CONFIG_X86_64 +#define SET_APIC_ID(x) (apic->set_apic_id(x)) +#else + +#endif + +/* + * Copyright 2004 James Cleverdon, IBM. + * Subject to the GNU Public License, v.2 + * + * Generic APIC sub-arch data struct. + * + * Hacked for x86-64 by James Cleverdon from i386 architecture code by + * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and + * James Cleverdon. + */ +struct apic { + char *name; + + int (*probe)(void); + int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); + int (*apic_id_registered)(void); + + u32 irq_delivery_mode; + u32 irq_dest_mode; + + const struct cpumask *(*target_cpus)(void); + + int disable_esr; + + int dest_logical; + unsigned long (*check_apicid_used)(physid_mask_t bitmap, int apicid); + unsigned long (*check_apicid_present)(int apicid); + + void (*vector_allocation_domain)(int cpu, struct cpumask *retmask); + void (*init_apic_ldr)(void); + + physid_mask_t (*ioapic_phys_id_map)(physid_mask_t map); + + void (*setup_apic_routing)(void); + int (*multi_timer_check)(int apic, int irq); + int (*apicid_to_node)(int logical_apicid); + int (*cpu_to_logical_apicid)(int cpu); + int (*cpu_present_to_apicid)(int mps_cpu); + physid_mask_t (*apicid_to_cpu_present)(int phys_apicid); + void (*setup_portio_remap)(void); + int (*check_phys_apicid_present)(int boot_cpu_physical_apicid); + void (*enable_apic_mode)(void); + int (*phys_pkg_id)(int cpuid_apic, int index_msb); + + /* + * When one of the next two hooks returns 1 the apic + * is switched to this. Essentially they are additional + * probe functions: + */ + int (*mps_oem_check)(struct mpc_table *mpc, char *oem, char *productid); + + unsigned int (*get_apic_id)(unsigned long x); + unsigned long (*set_apic_id)(unsigned int id); + unsigned long apic_id_mask; + + unsigned int (*cpu_mask_to_apicid)(const struct cpumask *cpumask); + unsigned int (*cpu_mask_to_apicid_and)(const struct cpumask *cpumask, + const struct cpumask *andmask); + + /* ipi */ + void (*send_IPI_mask)(const struct cpumask *mask, int vector); + void (*send_IPI_mask_allbutself)(const struct cpumask *mask, + int vector); + void (*send_IPI_allbutself)(int vector); + void (*send_IPI_all)(int vector); + void (*send_IPI_self)(int vector); + + /* wakeup_secondary_cpu */ + int (*wakeup_cpu)(int apicid, unsigned long start_eip); + + int trampoline_phys_low; + int trampoline_phys_high; + + void (*wait_for_init_deassert)(atomic_t *deassert); + void (*smp_callin_clear_local_apic)(void); + void (*inquire_remote_apic)(int apicid); + + /* apic ops */ + u32 (*read)(u32 reg); + void (*write)(u32 reg, u32 v); + u64 (*icr_read)(void); + void (*icr_write)(u32 low, u32 high); + void (*wait_icr_idle)(void); + u32 (*safe_wait_icr_idle)(void); +}; + +extern struct apic *apic; + +static inline u32 apic_read(u32 reg) +{ + return apic->read(reg); +} + +static inline void apic_write(u32 reg, u32 val) +{ + apic->write(reg, val); +} + +static inline u64 apic_icr_read(void) +{ + return apic->icr_read(); +} + +static inline void apic_icr_write(u32 low, u32 high) +{ + apic->icr_write(low, high); +} + +static inline void apic_wait_icr_idle(void) +{ + apic->wait_icr_idle(); +} + +static inline u32 safe_apic_wait_icr_idle(void) +{ + return apic->safe_wait_icr_idle(); +} + + +static inline void ack_APIC_irq(void) +{ + /* + * ack_APIC_irq() actually gets compiled as a single instruction + * ... yummie. + */ + + /* Docs say use 0 for future compatibility */ + apic_write(APIC_EOI, 0); +} + +static inline unsigned default_get_apic_id(unsigned long x) +{ + unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR)); + + if (APIC_XAPIC(ver)) + return (x >> 24) & 0xFF; + else + return (x >> 24) & 0x0F; +} + +/* + * Warm reset vector default position: + */ +#define DEFAULT_TRAMPOLINE_PHYS_LOW 0x467 +#define DEFAULT_TRAMPOLINE_PHYS_HIGH 0x469 + +#ifdef CONFIG_X86_32 +extern void es7000_update_apic_to_cluster(void); +#else +extern struct apic apic_flat; +extern struct apic apic_physflat; +extern struct apic apic_x2apic_cluster; +extern struct apic apic_x2apic_phys; +extern int default_acpi_madt_oem_check(char *, char *); + +extern void apic_send_IPI_self(int vector); + +extern struct apic apic_x2apic_uv_x; +DECLARE_PER_CPU(int, x2apic_extra_bits); + +extern int default_cpu_present_to_apicid(int mps_cpu); +extern int default_check_phys_apicid_present(int boot_cpu_physical_apicid); +#endif + +static inline void default_wait_for_init_deassert(atomic_t *deassert) +{ + while (!atomic_read(deassert)) + cpu_relax(); + return; +} + +extern void generic_bigsmp_probe(void); + + +#ifdef CONFIG_X86_LOCAL_APIC + +#include + +#define APIC_DFR_VALUE (APIC_DFR_FLAT) + +static inline const struct cpumask *default_target_cpus(void) +{ +#ifdef CONFIG_SMP + return cpu_online_mask; +#else + return cpumask_of(0); +#endif +} + +DECLARE_EARLY_PER_CPU(u16, x86_bios_cpu_apicid); + + +static inline unsigned int read_apic_id(void) +{ + unsigned int reg; + + reg = apic_read(APIC_ID); + + return apic->get_apic_id(reg); +} + +extern void default_setup_apic_routing(void); + +#ifdef CONFIG_X86_32 +/* + * Set up the logical destination ID. + * + * Intel recommends to set DFR, LDR and TPR before enabling + * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel + * document number 292116). So here it goes... + */ +extern void default_init_apic_ldr(void); + +static inline int default_apic_id_registered(void) +{ + return physid_isset(read_apic_id(), phys_cpu_present_map); +} + +static inline unsigned int +default_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + return cpumask_bits(cpumask)[0]; +} + +static inline unsigned int +default_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + unsigned long mask1 = cpumask_bits(cpumask)[0]; + unsigned long mask2 = cpumask_bits(andmask)[0]; + unsigned long mask3 = cpumask_bits(cpu_online_mask)[0]; + + return (unsigned int)(mask1 & mask2 & mask3); +} + +static inline int default_phys_pkg_id(int cpuid_apic, int index_msb) +{ + return cpuid_apic >> index_msb; +} + +extern int default_apicid_to_node(int logical_apicid); + +#endif + +static inline unsigned long default_check_apicid_used(physid_mask_t bitmap, int apicid) +{ + return physid_isset(apicid, bitmap); +} + +static inline unsigned long default_check_apicid_present(int bit) +{ + return physid_isset(bit, phys_cpu_present_map); +} + +static inline physid_mask_t default_ioapic_phys_id_map(physid_mask_t phys_map) +{ + return phys_map; +} + +/* Mapping from cpu number to logical apicid */ +static inline int default_cpu_to_logical_apicid(int cpu) +{ + return 1 << cpu; +} + +static inline int __default_cpu_present_to_apicid(int mps_cpu) +{ + if (mps_cpu < nr_cpu_ids && cpu_present(mps_cpu)) + return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); + else + return BAD_APICID; +} + +static inline int +__default_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return physid_isset(boot_cpu_physical_apicid, phys_cpu_present_map); +} + +#ifdef CONFIG_X86_32 +static inline int default_cpu_present_to_apicid(int mps_cpu) +{ + return __default_cpu_present_to_apicid(mps_cpu); +} + +static inline int +default_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return __default_check_phys_apicid_present(boot_cpu_physical_apicid); +} +#else +extern int default_cpu_present_to_apicid(int mps_cpu); +extern int default_check_phys_apicid_present(int boot_cpu_physical_apicid); +#endif + +static inline physid_mask_t default_apicid_to_cpu_present(int phys_apicid) +{ + return physid_mask_of_physid(phys_apicid); +} + +#endif /* CONFIG_X86_LOCAL_APIC */ + +#ifdef CONFIG_X86_32 +extern u8 cpu_2_logical_apicid[NR_CPUS]; +#endif + #endif /* _ASM_X86_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/apicnum.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/apicnum.h @@ -0,0 +1,12 @@ +#ifndef _ASM_X86_APICNUM_H +#define _ASM_X86_APICNUM_H + +/* define MAX_IO_APICS */ +#ifdef CONFIG_X86_32 +# define MAX_IO_APICS 64 +#else +# define MAX_IO_APICS 128 +# define MAX_LOCAL_APIC 32768 +#endif + +#endif /* _ASM_X86_APICNUM_H */ Index: linux-2.6-tip/arch/x86/include/asm/apm.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/apm.h @@ -0,0 +1,73 @@ +/* + * Machine specific APM BIOS functions for generic. + * Split out from apm.c by Osamu Tomita + */ + +#ifndef _ASM_X86_MACH_DEFAULT_APM_H +#define _ASM_X86_MACH_DEFAULT_APM_H + +#ifdef APM_ZERO_SEGS +# define APM_DO_ZERO_SEGS \ + "pushl %%ds\n\t" \ + "pushl %%es\n\t" \ + "xorl %%edx, %%edx\n\t" \ + "mov %%dx, %%ds\n\t" \ + "mov %%dx, %%es\n\t" \ + "mov %%dx, %%fs\n\t" \ + "mov %%dx, %%gs\n\t" +# define APM_DO_POP_SEGS \ + "popl %%es\n\t" \ + "popl %%ds\n\t" +#else +# define APM_DO_ZERO_SEGS +# define APM_DO_POP_SEGS +#endif + +static inline void apm_bios_call_asm(u32 func, u32 ebx_in, u32 ecx_in, + u32 *eax, u32 *ebx, u32 *ecx, + u32 *edx, u32 *esi) +{ + /* + * N.B. We do NOT need a cld after the BIOS call + * because we always save and restore the flags. + */ + __asm__ __volatile__(APM_DO_ZERO_SEGS + "pushl %%edi\n\t" + "pushl %%ebp\n\t" + "lcall *%%cs:apm_bios_entry\n\t" + "setc %%al\n\t" + "popl %%ebp\n\t" + "popl %%edi\n\t" + APM_DO_POP_SEGS + : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx), + "=S" (*esi) + : "a" (func), "b" (ebx_in), "c" (ecx_in) + : "memory", "cc"); +} + +static inline u8 apm_bios_call_simple_asm(u32 func, u32 ebx_in, + u32 ecx_in, u32 *eax) +{ + int cx, dx, si; + u8 error; + + /* + * N.B. We do NOT need a cld after the BIOS call + * because we always save and restore the flags. + */ + __asm__ __volatile__(APM_DO_ZERO_SEGS + "pushl %%edi\n\t" + "pushl %%ebp\n\t" + "lcall *%%cs:apm_bios_entry\n\t" + "setc %%bl\n\t" + "popl %%ebp\n\t" + "popl %%edi\n\t" + APM_DO_POP_SEGS + : "=a" (*eax), "=b" (error), "=c" (cx), "=d" (dx), + "=S" (si) + : "a" (func), "b" (ebx_in), "c" (ecx_in) + : "memory", "cc"); + return error; +} + +#endif /* _ASM_X86_MACH_DEFAULT_APM_H */ Index: linux-2.6-tip/arch/x86/include/asm/arch_hooks.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/arch_hooks.h +++ /dev/null @@ -1,26 +0,0 @@ -#ifndef _ASM_X86_ARCH_HOOKS_H -#define _ASM_X86_ARCH_HOOKS_H - -#include - -/* - * linux/include/asm/arch_hooks.h - * - * define the architecture specific hooks - */ - -/* these aren't arch hooks, they are generic routines - * that can be used by the hooks */ -extern void init_ISA_irqs(void); -extern irqreturn_t timer_interrupt(int irq, void *dev_id); - -/* these are the defined hooks */ -extern void intr_init_hook(void); -extern void pre_intr_init_hook(void); -extern void pre_setup_arch_hook(void); -extern void trap_init_hook(void); -extern void pre_time_init_hook(void); -extern void time_init_hook(void); -extern void mca_nmi_hook(void); - -#endif /* _ASM_X86_ARCH_HOOKS_H */ Index: linux-2.6-tip/arch/x86/include/asm/atomic_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/atomic_32.h +++ linux-2.6-tip/arch/x86/include/asm/atomic_32.h @@ -180,10 +180,10 @@ static inline int atomic_add_return(int #ifdef CONFIG_M386 no_xadd: /* Legacy 386 processor */ - local_irq_save(flags); + raw_local_irq_save(flags); __i = atomic_read(v); atomic_set(v, i + __i); - local_irq_restore(flags); + raw_local_irq_restore(flags); return i + __i; #endif } @@ -247,5 +247,223 @@ static inline int atomic_add_unless(atom #define smp_mb__before_atomic_inc() barrier() #define smp_mb__after_atomic_inc() barrier() +/* An 64bit atomic type */ + +typedef struct { + unsigned long long counter; +} atomic64_t; + +#define ATOMIC64_INIT(val) { (val) } + +/** + * atomic64_read - read atomic64 variable + * @v: pointer of type atomic64_t + * + * Atomically reads the value of @v. + * Doesn't imply a read memory barrier. + */ +#define __atomic64_read(ptr) ((ptr)->counter) + +static inline unsigned long long +cmpxchg8b(unsigned long long *ptr, unsigned long long old, unsigned long long new) +{ + asm volatile( + + LOCK_PREFIX "cmpxchg8b (%[ptr])\n" + + : "=A" (old) + + : [ptr] "D" (ptr), + "A" (old), + "b" (ll_low(new)), + "c" (ll_high(new)) + + : "memory"); + + return old; +} + +static inline unsigned long long +atomic64_cmpxchg(atomic64_t *ptr, unsigned long long old_val, + unsigned long long new_val) +{ + return cmpxchg8b(&ptr->counter, old_val, new_val); +} + +/** + * atomic64_set - set atomic64 variable + * @ptr: pointer to type atomic64_t + * @new_val: value to assign + * + * Atomically sets the value of @ptr to @new_val. + */ +static inline void atomic64_set(atomic64_t *ptr, unsigned long long new_val) +{ + unsigned long long old_val; + + do { + old_val = atomic_read(ptr); + } while (atomic64_cmpxchg(ptr, old_val, new_val) != old_val); +} + +/** + * atomic64_read - read atomic64 variable + * @ptr: pointer to type atomic64_t + * + * Atomically reads the value of @ptr and returns it. + */ +static inline unsigned long long atomic64_read(atomic64_t *ptr) +{ + unsigned long long curr_val; + + do { + curr_val = __atomic64_read(ptr); + } while (atomic64_cmpxchg(ptr, curr_val, curr_val) != curr_val); + + return curr_val; +} + +/** + * atomic64_add_return - add and return + * @delta: integer value to add + * @ptr: pointer to type atomic64_t + * + * Atomically adds @delta to @ptr and returns @delta + *@ptr + */ +static inline unsigned long long +atomic64_add_return(unsigned long long delta, atomic64_t *ptr) +{ + unsigned long long old_val, new_val; + + do { + old_val = atomic_read(ptr); + new_val = old_val + delta; + + } while (atomic64_cmpxchg(ptr, old_val, new_val) != old_val); + + return new_val; +} + +static inline long atomic64_sub_return(unsigned long long delta, atomic64_t *ptr) +{ + return atomic64_add_return(-delta, ptr); +} + +static inline long atomic64_inc_return(atomic64_t *ptr) +{ + return atomic64_add_return(1, ptr); +} + +static inline long atomic64_dec_return(atomic64_t *ptr) +{ + return atomic64_sub_return(1, ptr); +} + +/** + * atomic64_add - add integer to atomic64 variable + * @delta: integer value to add + * @ptr: pointer to type atomic64_t + * + * Atomically adds @delta to @ptr. + */ +static inline void atomic64_add(unsigned long long delta, atomic64_t *ptr) +{ + atomic64_add_return(delta, ptr); +} + +/** + * atomic64_sub - subtract the atomic64 variable + * @delta: integer value to subtract + * @ptr: pointer to type atomic64_t + * + * Atomically subtracts @delta from @ptr. + */ +static inline void atomic64_sub(unsigned long long delta, atomic64_t *ptr) +{ + atomic64_add(-delta, ptr); +} + +/** + * atomic64_sub_and_test - subtract value from variable and test result + * @delta: integer value to subtract + * @ptr: pointer to type atomic64_t + * + * Atomically subtracts @delta from @ptr and returns + * true if the result is zero, or false for all + * other cases. + */ +static inline int +atomic64_sub_and_test(unsigned long long delta, atomic64_t *ptr) +{ + unsigned long long old_val = atomic64_sub_return(delta, ptr); + + return old_val == 0; +} + +/** + * atomic64_inc - increment atomic64 variable + * @ptr: pointer to type atomic64_t + * + * Atomically increments @ptr by 1. + */ +static inline void atomic64_inc(atomic64_t *ptr) +{ + atomic64_add(1, ptr); +} + +/** + * atomic64_dec - decrement atomic64 variable + * @ptr: pointer to type atomic64_t + * + * Atomically decrements @ptr by 1. + */ +static inline void atomic64_dec(atomic64_t *ptr) +{ + atomic64_sub(1, ptr); +} + +/** + * atomic64_dec_and_test - decrement and test + * @ptr: pointer to type atomic64_t + * + * Atomically decrements @ptr by 1 and + * returns true if the result is 0, or false for all other + * cases. + */ +static inline int atomic64_dec_and_test(atomic64_t *ptr) +{ + return atomic64_sub_and_test(1, ptr); +} + +/** + * atomic64_inc_and_test - increment and test + * @ptr: pointer to type atomic64_t + * + * Atomically increments @ptr by 1 + * and returns true if the result is zero, or false for all + * other cases. + */ +static inline int atomic64_inc_and_test(atomic64_t *ptr) +{ + return atomic64_sub_and_test(-1, ptr); +} + +/** + * atomic64_add_negative - add and test if negative + * @delta: integer value to add + * @ptr: pointer to type atomic64_t + * + * Atomically adds @delta to @ptr and returns true + * if the result is negative, or false when + * result is greater than or equal to zero. + */ +static inline int +atomic64_add_negative(unsigned long long delta, atomic64_t *ptr) +{ + long long old_val = atomic64_add_return(delta, ptr); + + return old_val < 0; +} + #include #endif /* _ASM_X86_ATOMIC_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/bigsmp/apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/bigsmp/apic.h +++ /dev/null @@ -1,155 +0,0 @@ -#ifndef __ASM_MACH_APIC_H -#define __ASM_MACH_APIC_H - -#define xapic_phys_to_log_apicid(cpu) (per_cpu(x86_bios_cpu_apicid, cpu)) -#define esr_disable (1) - -static inline int apic_id_registered(void) -{ - return (1); -} - -static inline const cpumask_t *target_cpus(void) -{ -#ifdef CONFIG_SMP - return &cpu_online_map; -#else - return &cpumask_of_cpu(0); -#endif -} - -#undef APIC_DEST_LOGICAL -#define APIC_DEST_LOGICAL 0 -#define APIC_DFR_VALUE (APIC_DFR_FLAT) -#define INT_DELIVERY_MODE (dest_Fixed) -#define INT_DEST_MODE (0) /* phys delivery to target proc */ -#define NO_BALANCE_IRQ (0) - -static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) -{ - return (0); -} - -static inline unsigned long check_apicid_present(int bit) -{ - return (1); -} - -static inline unsigned long calculate_ldr(int cpu) -{ - unsigned long val, id; - val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; - id = xapic_phys_to_log_apicid(cpu); - val |= SET_APIC_LOGICAL_ID(id); - return val; -} - -/* - * Set up the logical destination ID. - * - * Intel recommends to set DFR, LDR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ -static inline void init_apic_ldr(void) -{ - unsigned long val; - int cpu = smp_processor_id(); - - apic_write(APIC_DFR, APIC_DFR_VALUE); - val = calculate_ldr(cpu); - apic_write(APIC_LDR, val); -} - -static inline void setup_apic_routing(void) -{ - printk("Enabling APIC mode: %s. Using %d I/O APICs\n", - "Physflat", nr_ioapics); -} - -static inline int multi_timer_check(int apic, int irq) -{ - return (0); -} - -static inline int apicid_to_node(int logical_apicid) -{ - return apicid_2_node[hard_smp_processor_id()]; -} - -static inline int cpu_present_to_apicid(int mps_cpu) -{ - if (mps_cpu < nr_cpu_ids) - return (int) per_cpu(x86_bios_cpu_apicid, mps_cpu); - - return BAD_APICID; -} - -static inline physid_mask_t apicid_to_cpu_present(int phys_apicid) -{ - return physid_mask_of_physid(phys_apicid); -} - -extern u8 cpu_2_logical_apicid[]; -/* Mapping from cpu number to logical apicid */ -static inline int cpu_to_logical_apicid(int cpu) -{ - if (cpu >= nr_cpu_ids) - return BAD_APICID; - return cpu_physical_id(cpu); -} - -static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_map) -{ - /* For clustered we don't have a good way to do this yet - hack */ - return physids_promote(0xFFL); -} - -static inline void setup_portio_remap(void) -{ -} - -static inline void enable_apic_mode(void) -{ -} - -static inline int check_phys_apicid_present(int boot_cpu_physical_apicid) -{ - return (1); -} - -/* As we are using single CPU as destination, pick only one CPU here */ -static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask) -{ - int cpu; - int apicid; - - cpu = first_cpu(*cpumask); - apicid = cpu_to_logical_apicid(cpu); - return apicid; -} - -static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - for_each_cpu_and(cpu, cpumask, andmask) - if (cpumask_test_cpu(cpu, cpu_online_mask)) - break; - if (cpu < nr_cpu_ids) - return cpu_to_logical_apicid(cpu); - - return BAD_APICID; -} - -static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb) -{ - return cpuid_apic >> index_msb; -} - -#endif /* __ASM_MACH_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/bigsmp/apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/bigsmp/apicdef.h +++ /dev/null @@ -1,13 +0,0 @@ -#ifndef __ASM_MACH_APICDEF_H -#define __ASM_MACH_APICDEF_H - -#define APIC_ID_MASK (0xFF<<24) - -static inline unsigned get_apic_id(unsigned long x) -{ - return (((x)>>24)&0xFF); -} - -#define GET_APIC_ID(x) get_apic_id(x) - -#endif Index: linux-2.6-tip/arch/x86/include/asm/bigsmp/ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/bigsmp/ipi.h +++ /dev/null @@ -1,22 +0,0 @@ -#ifndef __ASM_MACH_IPI_H -#define __ASM_MACH_IPI_H - -void send_IPI_mask_sequence(const struct cpumask *mask, int vector); -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector); - -static inline void send_IPI_mask(const struct cpumask *mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static inline void send_IPI_allbutself(int vector) -{ - send_IPI_mask_allbutself(cpu_online_mask, vector); -} - -static inline void send_IPI_all(int vector) -{ - send_IPI_mask(cpu_online_mask, vector); -} - -#endif /* __ASM_MACH_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/cacheflush.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/cacheflush.h +++ linux-2.6-tip/arch/x86/include/asm/cacheflush.h @@ -104,6 +104,11 @@ void clflush_cache_range(void *addr, uns #ifdef CONFIG_DEBUG_RODATA void mark_rodata_ro(void); extern const int rodata_test_data; +void set_kernel_text_rw(void); +void set_kernel_text_ro(void); +#else +static inline void set_kernel_text_rw(void) { } +static inline void set_kernel_text_ro(void) { } #endif #ifdef CONFIG_DEBUG_RODATA_TEST Index: linux-2.6-tip/arch/x86/include/asm/calling.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/calling.h +++ linux-2.6-tip/arch/x86/include/asm/calling.h @@ -1,5 +1,55 @@ /* - * Some macros to handle stack frames in assembly. + + x86 function call convention, 64-bit: + ------------------------------------- + arguments | callee-saved | extra caller-saved | return + [callee-clobbered] | | [callee-clobbered] | + --------------------------------------------------------------------------- + rdi rsi rdx rcx r8-9 | rbx rbp [*] r12-15 | r10-11 | rax, rdx [**] + + ( rsp is obviously invariant across normal function calls. (gcc can 'merge' + functions when it sees tail-call optimization possibilities) rflags is + clobbered. Leftover arguments are passed over the stack frame.) + + [*] In the frame-pointers case rbp is fixed to the stack frame. + + [**] for struct return values wider than 64 bits the return convention is a + bit more complex: up to 128 bits width we return small structures + straight in rax, rdx. For structures larger than that (3 words or + larger) the caller puts a pointer to an on-stack return struct + [allocated in the caller's stack frame] into the first argument - i.e. + into rdi. All other arguments shift up by one in this case. + Fortunately this case is rare in the kernel. + +For 32-bit we have the following conventions - kernel is built with +-mregparm=3 and -freg-struct-return: + + x86 function calling convention, 32-bit: + ---------------------------------------- + arguments | callee-saved | extra caller-saved | return + [callee-clobbered] | | [callee-clobbered] | + ------------------------------------------------------------------------- + eax edx ecx | ebx edi esi ebp [*] | | eax, edx [**] + + ( here too esp is obviously invariant across normal function calls. eflags + is clobbered. Leftover arguments are passed over the stack frame. ) + + [*] In the frame-pointers case ebp is fixed to the stack frame. + + [**] We build with -freg-struct-return, which on 32-bit means similar + semantics as on 64-bit: edx can be used for a second return value + (i.e. covering integer and structure sizes up to 64 bits) - after that + it gets more complex and more expensive: 3-word or larger struct returns + get done in the caller's frame and the pointer to the return struct goes + into regparm0, i.e. eax - the other arguments shift up and the + function's register parameters degenerate to regparm=2 in essence. + +*/ + + +/* + * 64-bit system call stack frame layout defines and helpers, + * for assembly code: */ #define R15 0 @@ -9,7 +59,7 @@ #define RBP 32 #define RBX 40 -/* arguments: interrupts/non tracing syscalls only save upto here*/ +/* arguments: interrupts/non tracing syscalls only save up to here: */ #define R11 48 #define R10 56 #define R9 64 @@ -22,7 +72,7 @@ #define ORIG_RAX 120 /* + error_code */ /* end of arguments */ -/* cpu exception frame or undefined in case of fast syscall. */ +/* cpu exception frame or undefined in case of fast syscall: */ #define RIP 128 #define CS 136 #define EFLAGS 144 Index: linux-2.6-tip/arch/x86/include/asm/cpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/cpu.h +++ linux-2.6-tip/arch/x86/include/asm/cpu.h @@ -7,6 +7,20 @@ #include #include +#ifdef CONFIG_SMP + +extern void prefill_possible_map(void); + +#else /* CONFIG_SMP */ + +static inline void prefill_possible_map(void) {} + +#define cpu_physical_id(cpu) boot_cpu_physical_apicid +#define safe_smp_processor_id() 0 +#define stack_smp_processor_id() 0 + +#endif /* CONFIG_SMP */ + struct x86_cpu { struct cpu cpu; }; @@ -17,4 +31,7 @@ extern void arch_unregister_cpu(int); #endif DECLARE_PER_CPU(int, cpu_state); + +extern unsigned int boot_cpu_id; + #endif /* _ASM_X86_CPU_H */ Index: linux-2.6-tip/arch/x86/include/asm/cpumask.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/cpumask.h @@ -0,0 +1,32 @@ +#ifndef _ASM_X86_CPUMASK_H +#define _ASM_X86_CPUMASK_H +#ifndef __ASSEMBLY__ +#include + +#ifdef CONFIG_X86_64 + +extern cpumask_var_t cpu_callin_mask; +extern cpumask_var_t cpu_callout_mask; +extern cpumask_var_t cpu_initialized_mask; +extern cpumask_var_t cpu_sibling_setup_mask; + +extern void setup_cpu_local_masks(void); + +#else /* CONFIG_X86_32 */ + +extern cpumask_t cpu_callin_map; +extern cpumask_t cpu_callout_map; +extern cpumask_t cpu_initialized; +extern cpumask_t cpu_sibling_setup_map; + +#define cpu_callin_mask ((struct cpumask *)&cpu_callin_map) +#define cpu_callout_mask ((struct cpumask *)&cpu_callout_map) +#define cpu_initialized_mask ((struct cpumask *)&cpu_initialized) +#define cpu_sibling_setup_mask ((struct cpumask *)&cpu_sibling_setup_map) + +static inline void setup_cpu_local_masks(void) { } + +#endif /* CONFIG_X86_32 */ + +#endif /* __ASSEMBLY__ */ +#endif /* _ASM_X86_CPUMASK_H */ Index: linux-2.6-tip/arch/x86/include/asm/current.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/current.h +++ linux-2.6-tip/arch/x86/include/asm/current.h @@ -1,39 +1,21 @@ #ifndef _ASM_X86_CURRENT_H #define _ASM_X86_CURRENT_H -#ifdef CONFIG_X86_32 #include #include +#ifndef __ASSEMBLY__ struct task_struct; DECLARE_PER_CPU(struct task_struct *, current_task); -static __always_inline struct task_struct *get_current(void) -{ - return x86_read_percpu(current_task); -} - -#else /* X86_32 */ - -#ifndef __ASSEMBLY__ -#include - -struct task_struct; static __always_inline struct task_struct *get_current(void) { - return read_pda(pcurrent); + return percpu_read(current_task); } -#else /* __ASSEMBLY__ */ - -#include -#define GET_CURRENT(reg) movq %gs:(pda_pcurrent),reg +#define current get_current() #endif /* __ASSEMBLY__ */ -#endif /* X86_32 */ - -#define current get_current() - #endif /* _ASM_X86_CURRENT_H */ Index: linux-2.6-tip/arch/x86/include/asm/device.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/device.h +++ linux-2.6-tip/arch/x86/include/asm/device.h @@ -6,7 +6,7 @@ struct dev_archdata { void *acpi_handle; #endif #ifdef CONFIG_X86_64 -struct dma_mapping_ops *dma_ops; +struct dma_map_ops *dma_ops; #endif #ifdef CONFIG_DMAR void *iommu; /* hook for IOMMU specific extension */ Index: linux-2.6-tip/arch/x86/include/asm/dma-mapping.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/dma-mapping.h +++ linux-2.6-tip/arch/x86/include/asm/dma-mapping.h @@ -6,7 +6,9 @@ * Documentation/DMA-API.txt for documentation. */ +#include #include +#include #include #include #include @@ -16,47 +18,9 @@ extern int iommu_merge; extern struct device x86_dma_fallback_dev; extern int panic_on_overflow; -struct dma_mapping_ops { - int (*mapping_error)(struct device *dev, - dma_addr_t dma_addr); - void* (*alloc_coherent)(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t gfp); - void (*free_coherent)(struct device *dev, size_t size, - void *vaddr, dma_addr_t dma_handle); - dma_addr_t (*map_single)(struct device *hwdev, phys_addr_t ptr, - size_t size, int direction); - void (*unmap_single)(struct device *dev, dma_addr_t addr, - size_t size, int direction); - void (*sync_single_for_cpu)(struct device *hwdev, - dma_addr_t dma_handle, size_t size, - int direction); - void (*sync_single_for_device)(struct device *hwdev, - dma_addr_t dma_handle, size_t size, - int direction); - void (*sync_single_range_for_cpu)(struct device *hwdev, - dma_addr_t dma_handle, unsigned long offset, - size_t size, int direction); - void (*sync_single_range_for_device)(struct device *hwdev, - dma_addr_t dma_handle, unsigned long offset, - size_t size, int direction); - void (*sync_sg_for_cpu)(struct device *hwdev, - struct scatterlist *sg, int nelems, - int direction); - void (*sync_sg_for_device)(struct device *hwdev, - struct scatterlist *sg, int nelems, - int direction); - int (*map_sg)(struct device *hwdev, struct scatterlist *sg, - int nents, int direction); - void (*unmap_sg)(struct device *hwdev, - struct scatterlist *sg, int nents, - int direction); - int (*dma_supported)(struct device *hwdev, u64 mask); - int is_phys; -}; +extern struct dma_map_ops *dma_ops; -extern struct dma_mapping_ops *dma_ops; - -static inline struct dma_mapping_ops *get_dma_ops(struct device *dev) +static inline struct dma_map_ops *get_dma_ops(struct device *dev) { #ifdef CONFIG_X86_32 return dma_ops; @@ -71,7 +35,7 @@ static inline struct dma_mapping_ops *ge /* Make sure we keep the same behaviour */ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); if (ops->mapping_error) return ops->mapping_error(dev, dma_addr); @@ -90,137 +54,146 @@ extern void *dma_generic_alloc_coherent( static inline dma_addr_t dma_map_single(struct device *hwdev, void *ptr, size_t size, - int direction) + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); - return ops->map_single(hwdev, virt_to_phys(ptr), size, direction); + BUG_ON(!valid_dma_direction(dir)); + kmemcheck_mark_initialized(ptr, size); + return ops->map_page(hwdev, virt_to_page(ptr), + (unsigned long)ptr & ~PAGE_MASK, size, + dir, NULL); } static inline void dma_unmap_single(struct device *dev, dma_addr_t addr, size_t size, - int direction) + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!valid_dma_direction(direction)); - if (ops->unmap_single) - ops->unmap_single(dev, addr, size, direction); + BUG_ON(!valid_dma_direction(dir)); + if (ops->unmap_page) + ops->unmap_page(dev, addr, size, dir, NULL); } static inline int dma_map_sg(struct device *hwdev, struct scatterlist *sg, - int nents, int direction) + int nents, enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); + + struct scatterlist *s; + int i; - BUG_ON(!valid_dma_direction(direction)); - return ops->map_sg(hwdev, sg, nents, direction); + for_each_sg(sg, s, nents, i) + kmemcheck_mark_initialized(sg_virt(s), s->length); + BUG_ON(!valid_dma_direction(dir)); + return ops->map_sg(hwdev, sg, nents, dir, NULL); } static inline void dma_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents, - int direction) + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->unmap_sg) - ops->unmap_sg(hwdev, sg, nents, direction); + ops->unmap_sg(hwdev, sg, nents, dir, NULL); } static inline void dma_sync_single_for_cpu(struct device *hwdev, dma_addr_t dma_handle, - size_t size, int direction) + size_t size, enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_single_for_cpu) - ops->sync_single_for_cpu(hwdev, dma_handle, size, direction); + ops->sync_single_for_cpu(hwdev, dma_handle, size, dir); flush_write_buffers(); } static inline void dma_sync_single_for_device(struct device *hwdev, dma_addr_t dma_handle, - size_t size, int direction) + size_t size, enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_single_for_device) - ops->sync_single_for_device(hwdev, dma_handle, size, direction); + ops->sync_single_for_device(hwdev, dma_handle, size, dir); flush_write_buffers(); } static inline void dma_sync_single_range_for_cpu(struct device *hwdev, dma_addr_t dma_handle, - unsigned long offset, size_t size, int direction) + unsigned long offset, size_t size, + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_single_range_for_cpu) ops->sync_single_range_for_cpu(hwdev, dma_handle, offset, - size, direction); + size, dir); flush_write_buffers(); } static inline void dma_sync_single_range_for_device(struct device *hwdev, dma_addr_t dma_handle, unsigned long offset, size_t size, - int direction) + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_single_range_for_device) ops->sync_single_range_for_device(hwdev, dma_handle, - offset, size, direction); + offset, size, dir); flush_write_buffers(); } static inline void dma_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg, - int nelems, int direction) + int nelems, enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_sg_for_cpu) - ops->sync_sg_for_cpu(hwdev, sg, nelems, direction); + ops->sync_sg_for_cpu(hwdev, sg, nelems, dir); flush_write_buffers(); } static inline void dma_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg, - int nelems, int direction) + int nelems, enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(hwdev); + struct dma_map_ops *ops = get_dma_ops(hwdev); - BUG_ON(!valid_dma_direction(direction)); + BUG_ON(!valid_dma_direction(dir)); if (ops->sync_sg_for_device) - ops->sync_sg_for_device(hwdev, sg, nelems, direction); + ops->sync_sg_for_device(hwdev, sg, nelems, dir); flush_write_buffers(); } static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, - int direction) + enum dma_data_direction dir) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); - BUG_ON(!valid_dma_direction(direction)); - return ops->map_single(dev, page_to_phys(page) + offset, - size, direction); + kmemcheck_mark_initialized(page_address(page) + offset, size); + BUG_ON(!valid_dma_direction(dir)); + return ops->map_page(dev, page, offset, size, dir, NULL); } static inline void dma_unmap_page(struct device *dev, dma_addr_t addr, - size_t size, int direction) + size_t size, enum dma_data_direction dir) { - dma_unmap_single(dev, addr, size, direction); + dma_unmap_single(dev, addr, size, dir); } static inline void @@ -266,7 +239,7 @@ static inline void * dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); void *memory; gfp &= ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32); @@ -292,7 +265,7 @@ dma_alloc_coherent(struct device *dev, s static inline void dma_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t bus) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); WARN_ON(irqs_disabled()); /* for portability */ Index: linux-2.6-tip/arch/x86/include/asm/do_timer.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/do_timer.h @@ -0,0 +1,16 @@ +/* defines for inline arch setup functions */ +#include + +#include +#include + +/** + * do_timer_interrupt_hook - hook into timer tick + * + * Call the pit clock event handler. see asm/i8253.h + **/ + +static inline void do_timer_interrupt_hook(void) +{ + global_clock_event->event_handler(global_clock_event); +} Index: linux-2.6-tip/arch/x86/include/asm/elf.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/elf.h +++ linux-2.6-tip/arch/x86/include/asm/elf.h @@ -112,7 +112,7 @@ extern unsigned int vdso_enabled; * now struct_user_regs, they are different) */ -#define ELF_CORE_COPY_REGS(pr_reg, regs) \ +#define ELF_CORE_COPY_REGS_COMMON(pr_reg, regs) \ do { \ pr_reg[0] = regs->bx; \ pr_reg[1] = regs->cx; \ @@ -124,7 +124,6 @@ do { \ pr_reg[7] = regs->ds & 0xffff; \ pr_reg[8] = regs->es & 0xffff; \ pr_reg[9] = regs->fs & 0xffff; \ - savesegment(gs, pr_reg[10]); \ pr_reg[11] = regs->orig_ax; \ pr_reg[12] = regs->ip; \ pr_reg[13] = regs->cs & 0xffff; \ @@ -133,6 +132,18 @@ do { \ pr_reg[16] = regs->ss & 0xffff; \ } while (0); +#define ELF_CORE_COPY_REGS(pr_reg, regs) \ +do { \ + ELF_CORE_COPY_REGS_COMMON(pr_reg, regs);\ + pr_reg[10] = get_user_gs(regs); \ +} while (0); + +#define ELF_CORE_COPY_KERNEL_REGS(pr_reg, regs) \ +do { \ + ELF_CORE_COPY_REGS_COMMON(pr_reg, regs);\ + savesegment(gs, pr_reg[10]); \ +} while (0); + #define ELF_PLATFORM (utsname()->machine) #define set_personality_64bit() do { } while (0) Index: linux-2.6-tip/arch/x86/include/asm/entry_arch.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/entry_arch.h @@ -0,0 +1,57 @@ +/* + * This file is designed to contain the BUILD_INTERRUPT specifications for + * all of the extra named interrupt vectors used by the architecture. + * Usually this is the Inter Process Interrupts (IPIs) + */ + +/* + * The following vectors are part of the Linux architecture, there + * is no hardware IRQ pin equivalent for them, they are triggered + * through the ICC by us (IPIs) + */ +#ifdef CONFIG_SMP +BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR) +BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR) +BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR) +BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR) + +BUILD_INTERRUPT3(invalidate_interrupt0,INVALIDATE_TLB_VECTOR_START+0, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt1,INVALIDATE_TLB_VECTOR_START+1, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt2,INVALIDATE_TLB_VECTOR_START+2, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt3,INVALIDATE_TLB_VECTOR_START+3, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt4,INVALIDATE_TLB_VECTOR_START+4, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt5,INVALIDATE_TLB_VECTOR_START+5, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt6,INVALIDATE_TLB_VECTOR_START+6, + smp_invalidate_interrupt) +BUILD_INTERRUPT3(invalidate_interrupt7,INVALIDATE_TLB_VECTOR_START+7, + smp_invalidate_interrupt) +#endif + +/* + * every pentium local APIC has two 'local interrupts', with a + * soft-definable vector attached to both interrupts, one of + * which is a timer interrupt, the other one is error counter + * overflow. Linux uses the local APIC timer interrupt to get + * a much simpler SMP time architecture: + */ +#ifdef CONFIG_X86_LOCAL_APIC + +BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR) +BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR) +BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR) + +#ifdef CONFIG_PERF_COUNTERS +BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR) +#endif + +#ifdef CONFIG_X86_MCE_P4THERMAL +BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR) +#endif + +#endif Index: linux-2.6-tip/arch/x86/include/asm/es7000/apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/es7000/apic.h +++ /dev/null @@ -1,242 +0,0 @@ -#ifndef __ASM_ES7000_APIC_H -#define __ASM_ES7000_APIC_H - -#include - -#define xapic_phys_to_log_apicid(cpu) per_cpu(x86_bios_cpu_apicid, cpu) -#define esr_disable (1) - -static inline int apic_id_registered(void) -{ - return (1); -} - -static inline const cpumask_t *target_cpus_cluster(void) -{ - return &CPU_MASK_ALL; -} - -static inline const cpumask_t *target_cpus(void) -{ - return &cpumask_of_cpu(smp_processor_id()); -} - -#define APIC_DFR_VALUE_CLUSTER (APIC_DFR_CLUSTER) -#define INT_DELIVERY_MODE_CLUSTER (dest_LowestPrio) -#define INT_DEST_MODE_CLUSTER (1) /* logical delivery broadcast to all procs */ -#define NO_BALANCE_IRQ_CLUSTER (1) - -#define APIC_DFR_VALUE (APIC_DFR_FLAT) -#define INT_DELIVERY_MODE (dest_Fixed) -#define INT_DEST_MODE (0) /* phys delivery to target procs */ -#define NO_BALANCE_IRQ (0) -#undef APIC_DEST_LOGICAL -#define APIC_DEST_LOGICAL 0x0 - -static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) -{ - return 0; -} -static inline unsigned long check_apicid_present(int bit) -{ - return physid_isset(bit, phys_cpu_present_map); -} - -#define apicid_cluster(apicid) (apicid & 0xF0) - -static inline unsigned long calculate_ldr(int cpu) -{ - unsigned long id; - id = xapic_phys_to_log_apicid(cpu); - return (SET_APIC_LOGICAL_ID(id)); -} - -/* - * Set up the logical destination ID. - * - * Intel recommends to set DFR, LdR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ -static inline void init_apic_ldr_cluster(void) -{ - unsigned long val; - int cpu = smp_processor_id(); - - apic_write(APIC_DFR, APIC_DFR_VALUE_CLUSTER); - val = calculate_ldr(cpu); - apic_write(APIC_LDR, val); -} - -static inline void init_apic_ldr(void) -{ - unsigned long val; - int cpu = smp_processor_id(); - - apic_write(APIC_DFR, APIC_DFR_VALUE); - val = calculate_ldr(cpu); - apic_write(APIC_LDR, val); -} - -extern int apic_version [MAX_APICS]; -static inline void setup_apic_routing(void) -{ - int apic = per_cpu(x86_bios_cpu_apicid, smp_processor_id()); - printk("Enabling APIC mode: %s. Using %d I/O APICs, target cpus %lx\n", - (apic_version[apic] == 0x14) ? - "Physical Cluster" : "Logical Cluster", - nr_ioapics, cpus_addr(*target_cpus())[0]); -} - -static inline int multi_timer_check(int apic, int irq) -{ - return 0; -} - -static inline int apicid_to_node(int logical_apicid) -{ - return 0; -} - - -static inline int cpu_present_to_apicid(int mps_cpu) -{ - if (!mps_cpu) - return boot_cpu_physical_apicid; - else if (mps_cpu < nr_cpu_ids) - return (int) per_cpu(x86_bios_cpu_apicid, mps_cpu); - else - return BAD_APICID; -} - -static inline physid_mask_t apicid_to_cpu_present(int phys_apicid) -{ - static int id = 0; - physid_mask_t mask; - mask = physid_mask_of_physid(id); - ++id; - return mask; -} - -extern u8 cpu_2_logical_apicid[]; -/* Mapping from cpu number to logical apicid */ -static inline int cpu_to_logical_apicid(int cpu) -{ -#ifdef CONFIG_SMP - if (cpu >= nr_cpu_ids) - return BAD_APICID; - return (int)cpu_2_logical_apicid[cpu]; -#else - return logical_smp_processor_id(); -#endif -} - -static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_map) -{ - /* For clustered we don't have a good way to do this yet - hack */ - return physids_promote(0xff); -} - - -static inline void setup_portio_remap(void) -{ -} - -extern unsigned int boot_cpu_physical_apicid; -static inline int check_phys_apicid_present(int cpu_physical_apicid) -{ - boot_cpu_physical_apicid = read_apic_id(); - return (1); -} - -static inline unsigned int -cpu_mask_to_apicid_cluster(const struct cpumask *cpumask) -{ - int num_bits_set; - int cpus_found = 0; - int cpu; - int apicid; - - num_bits_set = cpumask_weight(cpumask); - /* Return id to all */ - if (num_bits_set == nr_cpu_ids) - return 0xFF; - /* - * The cpus in the mask must all be on the apic cluster. If are not - * on the same apicid cluster return default value of TARGET_CPUS. - */ - cpu = cpumask_first(cpumask); - apicid = cpu_to_logical_apicid(cpu); - while (cpus_found < num_bits_set) { - if (cpumask_test_cpu(cpu, cpumask)) { - int new_apicid = cpu_to_logical_apicid(cpu); - if (apicid_cluster(apicid) != - apicid_cluster(new_apicid)){ - printk ("%s: Not a valid mask!\n", __func__); - return 0xFF; - } - apicid = new_apicid; - cpus_found++; - } - cpu++; - } - return apicid; -} - -static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask) -{ - int num_bits_set; - int cpus_found = 0; - int cpu; - int apicid; - - num_bits_set = cpus_weight(*cpumask); - /* Return id to all */ - if (num_bits_set == nr_cpu_ids) - return cpu_to_logical_apicid(0); - /* - * The cpus in the mask must all be on the apic cluster. If are not - * on the same apicid cluster return default value of TARGET_CPUS. - */ - cpu = first_cpu(*cpumask); - apicid = cpu_to_logical_apicid(cpu); - while (cpus_found < num_bits_set) { - if (cpu_isset(cpu, *cpumask)) { - int new_apicid = cpu_to_logical_apicid(cpu); - if (apicid_cluster(apicid) != - apicid_cluster(new_apicid)){ - printk ("%s: Not a valid mask!\n", __func__); - return cpu_to_logical_apicid(0); - } - apicid = new_apicid; - cpus_found++; - } - cpu++; - } - return apicid; -} - - -static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *inmask, - const struct cpumask *andmask) -{ - int apicid = cpu_to_logical_apicid(0); - cpumask_var_t cpumask; - - if (!alloc_cpumask_var(&cpumask, GFP_ATOMIC)) - return apicid; - - cpumask_and(cpumask, inmask, andmask); - cpumask_and(cpumask, cpumask, cpu_online_mask); - apicid = cpu_mask_to_apicid(cpumask); - - free_cpumask_var(cpumask); - return apicid; -} - -static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb) -{ - return cpuid_apic >> index_msb; -} - -#endif /* __ASM_ES7000_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/es7000/apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/es7000/apicdef.h +++ /dev/null @@ -1,13 +0,0 @@ -#ifndef __ASM_ES7000_APICDEF_H -#define __ASM_ES7000_APICDEF_H - -#define APIC_ID_MASK (0xFF<<24) - -static inline unsigned get_apic_id(unsigned long x) -{ - return (((x)>>24)&0xFF); -} - -#define GET_APIC_ID(x) get_apic_id(x) - -#endif Index: linux-2.6-tip/arch/x86/include/asm/es7000/ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/es7000/ipi.h +++ /dev/null @@ -1,22 +0,0 @@ -#ifndef __ASM_ES7000_IPI_H -#define __ASM_ES7000_IPI_H - -void send_IPI_mask_sequence(const struct cpumask *mask, int vector); -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector); - -static inline void send_IPI_mask(const struct cpumask *mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static inline void send_IPI_allbutself(int vector) -{ - send_IPI_mask_allbutself(cpu_online_mask, vector); -} - -static inline void send_IPI_all(int vector) -{ - send_IPI_mask(cpu_online_mask, vector); -} - -#endif /* __ASM_ES7000_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/es7000/mpparse.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/es7000/mpparse.h +++ /dev/null @@ -1,29 +0,0 @@ -#ifndef __ASM_ES7000_MPPARSE_H -#define __ASM_ES7000_MPPARSE_H - -#include - -extern int parse_unisys_oem (char *oemptr); -extern int find_unisys_acpi_oem_table(unsigned long *oem_addr); -extern void unmap_unisys_acpi_oem_table(unsigned long oem_addr); -extern void setup_unisys(void); - -#ifndef CONFIG_X86_GENERICARCH -extern int acpi_madt_oem_check(char *oem_id, char *oem_table_id); -extern int mps_oem_check(struct mpc_table *mpc, char *oem, char *productid); -#endif - -#ifdef CONFIG_ACPI - -static inline int es7000_check_dsdt(void) -{ - struct acpi_table_header header; - - if (ACPI_SUCCESS(acpi_get_table_header(ACPI_SIG_DSDT, 0, &header)) && - !strncmp(header.oem_id, "UNISYS", 6)) - return 1; - return 0; -} -#endif - -#endif /* __ASM_MACH_MPPARSE_H */ Index: linux-2.6-tip/arch/x86/include/asm/es7000/wakecpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/es7000/wakecpu.h +++ /dev/null @@ -1,37 +0,0 @@ -#ifndef __ASM_ES7000_WAKECPU_H -#define __ASM_ES7000_WAKECPU_H - -#define TRAMPOLINE_PHYS_LOW 0x467 -#define TRAMPOLINE_PHYS_HIGH 0x469 - -static inline void wait_for_init_deassert(atomic_t *deassert) -{ -#ifndef CONFIG_ES7000_CLUSTERED_APIC - while (!atomic_read(deassert)) - cpu_relax(); -#endif - return; -} - -/* Nothing to do for most platforms, since cleared by the INIT cycle */ -static inline void smp_callin_clear_local_apic(void) -{ -} - -static inline void store_NMI_vector(unsigned short *high, unsigned short *low) -{ -} - -static inline void restore_NMI_vector(unsigned short *high, unsigned short *low) -{ -} - -extern void __inquire_remote_apic(int apicid); - -static inline void inquire_remote_apic(int apicid) -{ - if (apic_verbosity >= APIC_DEBUG) - __inquire_remote_apic(apicid); -} - -#endif /* __ASM_MACH_WAKECPU_H */ Index: linux-2.6-tip/arch/x86/include/asm/fixmap_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/fixmap_32.h +++ linux-2.6-tip/arch/x86/include/asm/fixmap_32.h @@ -95,10 +95,6 @@ enum fixed_addresses { (__end_of_permanent_fixed_addresses & 255), FIX_BTMAP_BEGIN = FIX_BTMAP_END + NR_FIX_BTMAPS*FIX_BTMAPS_SLOTS - 1, FIX_WP_TEST, -#ifdef CONFIG_ACPI - FIX_ACPI_BEGIN, - FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1, -#endif #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT FIX_OHCI1394_BASE, #endif Index: linux-2.6-tip/arch/x86/include/asm/fixmap_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/fixmap_64.h +++ linux-2.6-tip/arch/x86/include/asm/fixmap_64.h @@ -50,10 +50,6 @@ enum fixed_addresses { FIX_PARAVIRT_BOOTMAP, #endif __end_of_permanent_fixed_addresses, -#ifdef CONFIG_ACPI - FIX_ACPI_BEGIN, - FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1, -#endif #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT FIX_OHCI1394_BASE, #endif Index: linux-2.6-tip/arch/x86/include/asm/ftrace.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/ftrace.h +++ linux-2.6-tip/arch/x86/include/asm/ftrace.h @@ -55,29 +55,4 @@ struct dyn_arch_ftrace { #endif /* __ASSEMBLY__ */ #endif /* CONFIG_FUNCTION_TRACER */ -#ifdef CONFIG_FUNCTION_GRAPH_TRACER - -#ifndef __ASSEMBLY__ - -/* - * Stack of return addresses for functions - * of a thread. - * Used in struct thread_info - */ -struct ftrace_ret_stack { - unsigned long ret; - unsigned long func; - unsigned long long calltime; -}; - -/* - * Primary handler of a function return. - * It relays on ftrace_return_to_handler. - * Defined in entry_32/64.S - */ -extern void return_to_handler(void); - -#endif /* __ASSEMBLY__ */ -#endif /* CONFIG_FUNCTION_GRAPH_TRACER */ - #endif /* _ASM_X86_FTRACE_H */ Index: linux-2.6-tip/arch/x86/include/asm/genapic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/genapic.h +++ linux-2.6-tip/arch/x86/include/asm/genapic.h @@ -1,5 +1 @@ -#ifdef CONFIG_X86_32 -# include "genapic_32.h" -#else -# include "genapic_64.h" -#endif +#include Index: linux-2.6-tip/arch/x86/include/asm/genapic_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/genapic_32.h +++ /dev/null @@ -1,148 +0,0 @@ -#ifndef _ASM_X86_GENAPIC_32_H -#define _ASM_X86_GENAPIC_32_H - -#include -#include - -/* - * Generic APIC driver interface. - * - * An straight forward mapping of the APIC related parts of the - * x86 subarchitecture interface to a dynamic object. - * - * This is used by the "generic" x86 subarchitecture. - * - * Copyright 2003 Andi Kleen, SuSE Labs. - */ - -struct mpc_bus; -struct mpc_table; -struct mpc_cpu; - -struct genapic { - char *name; - int (*probe)(void); - - int (*apic_id_registered)(void); - const struct cpumask *(*target_cpus)(void); - int int_delivery_mode; - int int_dest_mode; - int ESR_DISABLE; - int apic_destination_logical; - unsigned long (*check_apicid_used)(physid_mask_t bitmap, int apicid); - unsigned long (*check_apicid_present)(int apicid); - int no_balance_irq; - int no_ioapic_check; - void (*init_apic_ldr)(void); - physid_mask_t (*ioapic_phys_id_map)(physid_mask_t map); - - void (*setup_apic_routing)(void); - int (*multi_timer_check)(int apic, int irq); - int (*apicid_to_node)(int logical_apicid); - int (*cpu_to_logical_apicid)(int cpu); - int (*cpu_present_to_apicid)(int mps_cpu); - physid_mask_t (*apicid_to_cpu_present)(int phys_apicid); - void (*setup_portio_remap)(void); - int (*check_phys_apicid_present)(int boot_cpu_physical_apicid); - void (*enable_apic_mode)(void); - u32 (*phys_pkg_id)(u32 cpuid_apic, int index_msb); - - /* mpparse */ - /* When one of the next two hooks returns 1 the genapic - is switched to this. Essentially they are additional probe - functions. */ - int (*mps_oem_check)(struct mpc_table *mpc, char *oem, - char *productid); - int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); - - unsigned (*get_apic_id)(unsigned long x); - unsigned long apic_id_mask; - unsigned int (*cpu_mask_to_apicid)(const struct cpumask *cpumask); - unsigned int (*cpu_mask_to_apicid_and)(const struct cpumask *cpumask, - const struct cpumask *andmask); - void (*vector_allocation_domain)(int cpu, struct cpumask *retmask); - -#ifdef CONFIG_SMP - /* ipi */ - void (*send_IPI_mask)(const struct cpumask *mask, int vector); - void (*send_IPI_mask_allbutself)(const struct cpumask *mask, - int vector); - void (*send_IPI_allbutself)(int vector); - void (*send_IPI_all)(int vector); -#endif - int (*wakeup_cpu)(int apicid, unsigned long start_eip); - int trampoline_phys_low; - int trampoline_phys_high; - void (*wait_for_init_deassert)(atomic_t *deassert); - void (*smp_callin_clear_local_apic)(void); - void (*store_NMI_vector)(unsigned short *high, unsigned short *low); - void (*restore_NMI_vector)(unsigned short *high, unsigned short *low); - void (*inquire_remote_apic)(int apicid); -}; - -#define APICFUNC(x) .x = x, - -/* More functions could be probably marked IPIFUNC and save some space - in UP GENERICARCH kernels, but I don't have the nerve right now - to untangle this mess. -AK */ -#ifdef CONFIG_SMP -#define IPIFUNC(x) APICFUNC(x) -#else -#define IPIFUNC(x) -#endif - -#define APIC_INIT(aname, aprobe) \ -{ \ - .name = aname, \ - .probe = aprobe, \ - .int_delivery_mode = INT_DELIVERY_MODE, \ - .int_dest_mode = INT_DEST_MODE, \ - .no_balance_irq = NO_BALANCE_IRQ, \ - .ESR_DISABLE = esr_disable, \ - .apic_destination_logical = APIC_DEST_LOGICAL, \ - APICFUNC(apic_id_registered) \ - APICFUNC(target_cpus) \ - APICFUNC(check_apicid_used) \ - APICFUNC(check_apicid_present) \ - APICFUNC(init_apic_ldr) \ - APICFUNC(ioapic_phys_id_map) \ - APICFUNC(setup_apic_routing) \ - APICFUNC(multi_timer_check) \ - APICFUNC(apicid_to_node) \ - APICFUNC(cpu_to_logical_apicid) \ - APICFUNC(cpu_present_to_apicid) \ - APICFUNC(apicid_to_cpu_present) \ - APICFUNC(setup_portio_remap) \ - APICFUNC(check_phys_apicid_present) \ - APICFUNC(mps_oem_check) \ - APICFUNC(get_apic_id) \ - .apic_id_mask = APIC_ID_MASK, \ - APICFUNC(cpu_mask_to_apicid) \ - APICFUNC(cpu_mask_to_apicid_and) \ - APICFUNC(vector_allocation_domain) \ - APICFUNC(acpi_madt_oem_check) \ - IPIFUNC(send_IPI_mask) \ - IPIFUNC(send_IPI_allbutself) \ - IPIFUNC(send_IPI_all) \ - APICFUNC(enable_apic_mode) \ - APICFUNC(phys_pkg_id) \ - .trampoline_phys_low = TRAMPOLINE_PHYS_LOW, \ - .trampoline_phys_high = TRAMPOLINE_PHYS_HIGH, \ - APICFUNC(wait_for_init_deassert) \ - APICFUNC(smp_callin_clear_local_apic) \ - APICFUNC(store_NMI_vector) \ - APICFUNC(restore_NMI_vector) \ - APICFUNC(inquire_remote_apic) \ -} - -extern struct genapic *genapic; -extern void es7000_update_genapic_to_cluster(void); - -enum uv_system_type {UV_NONE, UV_LEGACY_APIC, UV_X2APIC, UV_NON_UNIQUE_APIC}; -#define get_uv_system_type() UV_NONE -#define is_uv_system() 0 -#define uv_wakeup_secondary(a, b) 1 -#define uv_system_init() do {} while (0) - - -#endif /* _ASM_X86_GENAPIC_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/genapic_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/genapic_64.h +++ /dev/null @@ -1,66 +0,0 @@ -#ifndef _ASM_X86_GENAPIC_64_H -#define _ASM_X86_GENAPIC_64_H - -#include - -/* - * Copyright 2004 James Cleverdon, IBM. - * Subject to the GNU Public License, v.2 - * - * Generic APIC sub-arch data struct. - * - * Hacked for x86-64 by James Cleverdon from i386 architecture code by - * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and - * James Cleverdon. - */ - -struct genapic { - char *name; - int (*acpi_madt_oem_check)(char *oem_id, char *oem_table_id); - u32 int_delivery_mode; - u32 int_dest_mode; - int (*apic_id_registered)(void); - const struct cpumask *(*target_cpus)(void); - void (*vector_allocation_domain)(int cpu, struct cpumask *retmask); - void (*init_apic_ldr)(void); - /* ipi */ - void (*send_IPI_mask)(const struct cpumask *mask, int vector); - void (*send_IPI_mask_allbutself)(const struct cpumask *mask, - int vector); - void (*send_IPI_allbutself)(int vector); - void (*send_IPI_all)(int vector); - void (*send_IPI_self)(int vector); - /* */ - unsigned int (*cpu_mask_to_apicid)(const struct cpumask *cpumask); - unsigned int (*cpu_mask_to_apicid_and)(const struct cpumask *cpumask, - const struct cpumask *andmask); - unsigned int (*phys_pkg_id)(int index_msb); - unsigned int (*get_apic_id)(unsigned long x); - unsigned long (*set_apic_id)(unsigned int id); - unsigned long apic_id_mask; - /* wakeup_secondary_cpu */ - int (*wakeup_cpu)(int apicid, unsigned long start_eip); -}; - -extern struct genapic *genapic; - -extern struct genapic apic_flat; -extern struct genapic apic_physflat; -extern struct genapic apic_x2apic_cluster; -extern struct genapic apic_x2apic_phys; -extern int acpi_madt_oem_check(char *, char *); - -extern void apic_send_IPI_self(int vector); -enum uv_system_type {UV_NONE, UV_LEGACY_APIC, UV_X2APIC, UV_NON_UNIQUE_APIC}; -extern enum uv_system_type get_uv_system_type(void); -extern int is_uv_system(void); - -extern struct genapic apic_x2apic_uv_x; -DECLARE_PER_CPU(int, x2apic_extra_bits); -extern void uv_cpu_init(void); -extern void uv_system_init(void); -extern int uv_wakeup_secondary(int phys_apicid, unsigned int start_rip); - -extern void setup_apic_routing(void); - -#endif /* _ASM_X86_GENAPIC_64_H */ Index: linux-2.6-tip/arch/x86/include/asm/hardirq.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/hardirq.h +++ linux-2.6-tip/arch/x86/include/asm/hardirq.h @@ -1,11 +1,53 @@ -#ifdef CONFIG_X86_32 -# include "hardirq_32.h" -#else -# include "hardirq_64.h" +#ifndef _ASM_X86_HARDIRQ_H +#define _ASM_X86_HARDIRQ_H + +#include +#include + +typedef struct { + unsigned int __softirq_pending; + unsigned int __nmi_count; /* arch dependent */ + unsigned int irq0_irqs; +#ifdef CONFIG_X86_LOCAL_APIC + unsigned int apic_timer_irqs; /* arch dependent */ + unsigned int irq_spurious_count; +#endif + unsigned int apic_perf_irqs; +#ifdef CONFIG_SMP + unsigned int irq_resched_count; + unsigned int irq_call_count; + unsigned int irq_tlb_count; +#endif +#ifdef CONFIG_X86_MCE + unsigned int irq_thermal_count; +# ifdef CONFIG_X86_64 + unsigned int irq_threshold_count; +# endif #endif +} ____cacheline_aligned irq_cpustat_t; + +DECLARE_PER_CPU(irq_cpustat_t, irq_stat); + +/* We can have at most NR_VECTORS irqs routed to a cpu at a time */ +#define MAX_HARDIRQS_PER_CPU NR_VECTORS + +#define __ARCH_IRQ_STAT + +#define inc_irq_stat(member) percpu_add(irq_stat.member, 1) + +#define local_softirq_pending() percpu_read(irq_stat.__softirq_pending) + +#define __ARCH_SET_SOFTIRQ_PENDING + +#define set_softirq_pending(x) percpu_write(irq_stat.__softirq_pending, (x)) +#define or_softirq_pending(x) percpu_or(irq_stat.__softirq_pending, (x)) + +extern void ack_bad_irq(unsigned int irq); extern u64 arch_irq_stat_cpu(unsigned int cpu); #define arch_irq_stat_cpu arch_irq_stat_cpu extern u64 arch_irq_stat(void); #define arch_irq_stat arch_irq_stat + +#endif /* _ASM_X86_HARDIRQ_H */ Index: linux-2.6-tip/arch/x86/include/asm/hardirq_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/hardirq_32.h +++ /dev/null @@ -1,30 +0,0 @@ -#ifndef _ASM_X86_HARDIRQ_32_H -#define _ASM_X86_HARDIRQ_32_H - -#include -#include - -typedef struct { - unsigned int __softirq_pending; - unsigned long idle_timestamp; - unsigned int __nmi_count; /* arch dependent */ - unsigned int apic_timer_irqs; /* arch dependent */ - unsigned int irq0_irqs; - unsigned int irq_resched_count; - unsigned int irq_call_count; - unsigned int irq_tlb_count; - unsigned int irq_thermal_count; - unsigned int irq_spurious_count; -} ____cacheline_aligned irq_cpustat_t; - -DECLARE_PER_CPU(irq_cpustat_t, irq_stat); - -#define __ARCH_IRQ_STAT -#define __IRQ_STAT(cpu, member) (per_cpu(irq_stat, cpu).member) - -#define inc_irq_stat(member) (__get_cpu_var(irq_stat).member++) - -void ack_bad_irq(unsigned int irq); -#include - -#endif /* _ASM_X86_HARDIRQ_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/hardirq_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/hardirq_64.h +++ /dev/null @@ -1,25 +0,0 @@ -#ifndef _ASM_X86_HARDIRQ_64_H -#define _ASM_X86_HARDIRQ_64_H - -#include -#include -#include -#include - -/* We can have at most NR_VECTORS irqs routed to a cpu at a time */ -#define MAX_HARDIRQS_PER_CPU NR_VECTORS - -#define __ARCH_IRQ_STAT 1 - -#define inc_irq_stat(member) add_pda(member, 1) - -#define local_softirq_pending() read_pda(__softirq_pending) - -#define __ARCH_SET_SOFTIRQ_PENDING 1 - -#define set_softirq_pending(x) write_pda(__softirq_pending, (x)) -#define or_softirq_pending(x) or_pda(__softirq_pending, (x)) - -extern void ack_bad_irq(unsigned int irq); - -#endif /* _ASM_X86_HARDIRQ_64_H */ Index: linux-2.6-tip/arch/x86/include/asm/hw_irq.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/hw_irq.h +++ linux-2.6-tip/arch/x86/include/asm/hw_irq.h @@ -25,11 +25,11 @@ #include #include -#define platform_legacy_irq(irq) ((irq) < 16) - /* Interrupt handlers registered during init_IRQ */ extern void apic_timer_interrupt(void); extern void error_interrupt(void); +extern void perf_counter_interrupt(void); + extern void spurious_interrupt(void); extern void thermal_interrupt(void); extern void reschedule_interrupt(void); @@ -58,7 +58,7 @@ extern void make_8259A_irq(unsigned int extern void init_8259A(int aeoi); /* IOAPIC */ -#define IO_APIC_IRQ(x) (((x) >= 16) || ((1<<(x)) & io_apic_irqs)) +#define IO_APIC_IRQ(x) (((x) >= NR_IRQS_LEGACY) || ((1<<(x)) & io_apic_irqs)) extern unsigned long io_apic_irqs; extern void init_VISWS_APIC_irqs(void); @@ -67,15 +67,7 @@ extern void disable_IO_APIC(void); extern int IO_APIC_get_PCI_irq_vector(int bus, int slot, int fn); extern void setup_ioapic_dest(void); -#ifdef CONFIG_X86_64 extern void enable_IO_APIC(void); -#endif - -/* IPI functions */ -#ifdef CONFIG_X86_32 -extern void send_IPI_self(int vector); -#endif -extern void send_IPI(int dest, int vector); /* Statistics */ extern atomic_t irq_err_count; @@ -84,21 +76,11 @@ extern atomic_t irq_mis_count; /* EISA */ extern void eisa_set_level_irq(unsigned int irq); -/* Voyager functions */ -extern asmlinkage void vic_cpi_interrupt(void); -extern asmlinkage void vic_sys_interrupt(void); -extern asmlinkage void vic_cmn_interrupt(void); -extern asmlinkage void qic_timer_interrupt(void); -extern asmlinkage void qic_invalidate_interrupt(void); -extern asmlinkage void qic_reschedule_interrupt(void); -extern asmlinkage void qic_enable_irq_interrupt(void); -extern asmlinkage void qic_call_function_interrupt(void); - /* SMP */ extern void smp_apic_timer_interrupt(struct pt_regs *); extern void smp_spurious_interrupt(struct pt_regs *); extern void smp_error_interrupt(struct pt_regs *); -#ifdef CONFIG_X86_SMP +#ifdef CONFIG_SMP extern void smp_reschedule_interrupt(struct pt_regs *); extern void smp_call_function_interrupt(struct pt_regs *); extern void smp_call_function_single_interrupt(struct pt_regs *); Index: linux-2.6-tip/arch/x86/include/asm/i8259.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/i8259.h +++ linux-2.6-tip/arch/x86/include/asm/i8259.h @@ -24,7 +24,7 @@ extern unsigned int cached_irq_mask; #define SLAVE_ICW4_DEFAULT 0x01 #define PIC_ICW4_AEOI 2 -extern spinlock_t i8259A_lock; +extern raw_spinlock_t i8259A_lock; extern void init_8259A(int auto_eoi); extern void enable_8259A_irq(unsigned int irq); @@ -60,4 +60,8 @@ extern struct irq_chip i8259A_chip; extern void mask_8259A(void); extern void unmask_8259A(void); +#ifdef CONFIG_X86_32 +extern void init_ISA_irqs(void); +#endif + #endif /* _ASM_X86_I8259_H */ Index: linux-2.6-tip/arch/x86/include/asm/intel_arch_perfmon.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/intel_arch_perfmon.h +++ /dev/null @@ -1,31 +0,0 @@ -#ifndef _ASM_X86_INTEL_ARCH_PERFMON_H -#define _ASM_X86_INTEL_ARCH_PERFMON_H - -#define MSR_ARCH_PERFMON_PERFCTR0 0xc1 -#define MSR_ARCH_PERFMON_PERFCTR1 0xc2 - -#define MSR_ARCH_PERFMON_EVENTSEL0 0x186 -#define MSR_ARCH_PERFMON_EVENTSEL1 0x187 - -#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22) -#define ARCH_PERFMON_EVENTSEL_INT (1 << 20) -#define ARCH_PERFMON_EVENTSEL_OS (1 << 17) -#define ARCH_PERFMON_EVENTSEL_USR (1 << 16) - -#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL (0x3c) -#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8) -#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0) -#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \ - (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX)) - -union cpuid10_eax { - struct { - unsigned int version_id:8; - unsigned int num_counters:8; - unsigned int bit_width:8; - unsigned int mask_length:8; - } split; - unsigned int full; -}; - -#endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */ Index: linux-2.6-tip/arch/x86/include/asm/io.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/io.h +++ linux-2.6-tip/arch/x86/include/asm/io.h @@ -5,6 +5,7 @@ #include #include +#include #define build_mmio_read(name, size, type, reg, barrier) \ static inline type name(const volatile void __iomem *addr) \ @@ -80,6 +81,100 @@ static inline void writeq(__u64 val, vol #define readq readq #define writeq writeq +/** + * virt_to_phys - map virtual addresses to physical + * @address: address to remap + * + * The returned physical address is the physical (CPU) mapping for + * the memory address given. It is only valid to use this function on + * addresses directly mapped or allocated via kmalloc. + * + * This function does not give bus mappings for DMA transfers. In + * almost all conceivable cases a device driver should not be using + * this function + */ + +static inline phys_addr_t virt_to_phys(volatile void *address) +{ + return __pa(address); +} + +/** + * phys_to_virt - map physical address to virtual + * @address: address to remap + * + * The returned virtual address is a current CPU mapping for + * the memory address given. It is only valid to use this function on + * addresses that have a kernel mapping + * + * This function does not handle bus mappings for DMA transfers. In + * almost all conceivable cases a device driver should not be using + * this function + */ + +static inline void *phys_to_virt(phys_addr_t address) +{ + return __va(address); +} + +/* + * Change "struct page" to physical address. + */ +#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT) + +/* + * ISA I/O bus memory addresses are 1:1 with the physical address. + * However, we truncate the address to unsigned int to avoid undesirable + * promitions in legacy drivers. + */ +static inline unsigned int isa_virt_to_bus(volatile void *address) +{ + return (unsigned int)virt_to_phys(address); +} +#define isa_page_to_bus(page) ((unsigned int)page_to_phys(page)) +#define isa_bus_to_virt phys_to_virt + +/* + * However PCI ones are not necessarily 1:1 and therefore these interfaces + * are forbidden in portable PCI drivers. + * + * Allow them on x86 for legacy drivers, though. + */ +#define virt_to_bus virt_to_phys +#define bus_to_virt phys_to_virt + +/** + * ioremap - map bus memory into CPU space + * @offset: bus address of the memory + * @size: size of the resource to map + * + * ioremap performs a platform specific sequence of operations to + * make bus memory CPU accessible via the readb/readw/readl/writeb/ + * writew/writel functions and the other mmio helpers. The returned + * address is not guaranteed to be usable directly as a virtual + * address. + * + * If the area you are trying to map is a PCI BAR you should have a + * look at pci_iomap(). + */ +extern void __iomem *ioremap_nocache(resource_size_t offset, unsigned long size); +extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size); +extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, + unsigned long prot_val); + +/* + * The default ioremap() behavior is non-cached: + */ +static inline void __iomem *ioremap(resource_size_t offset, unsigned long size) +{ + return ioremap_nocache(offset, size); +} + +extern void iounmap(volatile void __iomem *addr); + +extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys); + + #ifdef CONFIG_X86_32 # include "io_32.h" #else @@ -91,7 +186,7 @@ extern void unxlate_dev_mem_ptr(unsigned extern int ioremap_change_attr(unsigned long vaddr, unsigned long size, unsigned long prot_val); -extern void __iomem *ioremap_wc(unsigned long offset, unsigned long size); +extern void __iomem *ioremap_wc(resource_size_t offset, unsigned long size); /* * early_ioremap() and early_iounmap() are for temporary early boot-time @@ -105,5 +200,6 @@ extern void __iomem *early_memremap(unsi extern void early_iounmap(void __iomem *addr, unsigned long size); extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys); +#define IO_SPACE_LIMIT 0xffff #endif /* _ASM_X86_IO_H */ Index: linux-2.6-tip/arch/x86/include/asm/io_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/io_32.h +++ linux-2.6-tip/arch/x86/include/asm/io_32.h @@ -37,8 +37,6 @@ * - Arnaldo Carvalho de Melo */ -#define IO_SPACE_LIMIT 0xffff - #define XQUAD_PORTIO_BASE 0xfe400000 #define XQUAD_PORTIO_QUAD 0x40000 /* 256k per quad. */ @@ -53,92 +51,6 @@ */ #define xlate_dev_kmem_ptr(p) p -/** - * virt_to_phys - map virtual addresses to physical - * @address: address to remap - * - * The returned physical address is the physical (CPU) mapping for - * the memory address given. It is only valid to use this function on - * addresses directly mapped or allocated via kmalloc. - * - * This function does not give bus mappings for DMA transfers. In - * almost all conceivable cases a device driver should not be using - * this function - */ - -static inline unsigned long virt_to_phys(volatile void *address) -{ - return __pa(address); -} - -/** - * phys_to_virt - map physical address to virtual - * @address: address to remap - * - * The returned virtual address is a current CPU mapping for - * the memory address given. It is only valid to use this function on - * addresses that have a kernel mapping - * - * This function does not handle bus mappings for DMA transfers. In - * almost all conceivable cases a device driver should not be using - * this function - */ - -static inline void *phys_to_virt(unsigned long address) -{ - return __va(address); -} - -/* - * Change "struct page" to physical address. - */ -#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT) - -/** - * ioremap - map bus memory into CPU space - * @offset: bus address of the memory - * @size: size of the resource to map - * - * ioremap performs a platform specific sequence of operations to - * make bus memory CPU accessible via the readb/readw/readl/writeb/ - * writew/writel functions and the other mmio helpers. The returned - * address is not guaranteed to be usable directly as a virtual - * address. - * - * If the area you are trying to map is a PCI BAR you should have a - * look at pci_iomap(). - */ -extern void __iomem *ioremap_nocache(resource_size_t offset, unsigned long size); -extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size); -extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, - unsigned long prot_val); - -/* - * The default ioremap() behavior is non-cached: - */ -static inline void __iomem *ioremap(resource_size_t offset, unsigned long size) -{ - return ioremap_nocache(offset, size); -} - -extern void iounmap(volatile void __iomem *addr); - -/* - * ISA I/O bus memory addresses are 1:1 with the physical address. - */ -#define isa_virt_to_bus virt_to_phys -#define isa_page_to_bus page_to_phys -#define isa_bus_to_virt phys_to_virt - -/* - * However PCI ones are not necessarily 1:1 and therefore these interfaces - * are forbidden in portable PCI drivers. - * - * Allow them on x86 for legacy drivers, though. - */ -#define virt_to_bus virt_to_phys -#define bus_to_virt phys_to_virt - static inline void memset_io(volatile void __iomem *addr, unsigned char val, int count) { Index: linux-2.6-tip/arch/x86/include/asm/io_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/io_64.h +++ linux-2.6-tip/arch/x86/include/asm/io_64.h @@ -136,73 +136,12 @@ __OUTS(b) __OUTS(w) __OUTS(l) -#define IO_SPACE_LIMIT 0xffff - #if defined(__KERNEL__) && defined(__x86_64__) #include -#ifndef __i386__ -/* - * Change virtual addresses to physical addresses and vv. - * These are pretty trivial - */ -static inline unsigned long virt_to_phys(volatile void *address) -{ - return __pa(address); -} - -static inline void *phys_to_virt(unsigned long address) -{ - return __va(address); -} -#endif - -/* - * Change "struct page" to physical address. - */ -#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT) - #include -/* - * This one maps high address device memory and turns off caching for that area. - * it's useful if some control registers are in such an area and write combining - * or read caching is not desirable: - */ -extern void __iomem *ioremap_nocache(resource_size_t offset, unsigned long size); -extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size); -extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, - unsigned long prot_val); - -/* - * The default ioremap() behavior is non-cached: - */ -static inline void __iomem *ioremap(resource_size_t offset, unsigned long size) -{ - return ioremap_nocache(offset, size); -} - -extern void iounmap(volatile void __iomem *addr); - -extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys); - -/* - * ISA I/O bus memory addresses are 1:1 with the physical address. - */ -#define isa_virt_to_bus virt_to_phys -#define isa_page_to_bus page_to_phys -#define isa_bus_to_virt phys_to_virt - -/* - * However PCI ones are not necessarily 1:1 and therefore these interfaces - * are forbidden in portable PCI drivers. - * - * Allow them on x86 for legacy drivers, though. - */ -#define virt_to_bus virt_to_phys -#define bus_to_virt phys_to_virt - void __memcpy_fromio(void *, unsigned long, unsigned); void __memcpy_toio(unsigned long, const void *, unsigned); Index: linux-2.6-tip/arch/x86/include/asm/io_apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/io_apic.h +++ linux-2.6-tip/arch/x86/include/asm/io_apic.h @@ -114,38 +114,16 @@ struct IR_IO_APIC_route_entry { extern int nr_ioapics; extern int nr_ioapic_registers[MAX_IO_APICS]; -/* - * MP-BIOS irq configuration table structures: - */ - #define MP_MAX_IOAPIC_PIN 127 -struct mp_config_ioapic { - unsigned long mp_apicaddr; - unsigned int mp_apicid; - unsigned char mp_type; - unsigned char mp_apicver; - unsigned char mp_flags; -}; - -struct mp_config_intsrc { - unsigned int mp_dstapic; - unsigned char mp_type; - unsigned char mp_irqtype; - unsigned short mp_irqflag; - unsigned char mp_srcbus; - unsigned char mp_srcbusirq; - unsigned char mp_dstirq; -}; - /* I/O APIC entries */ -extern struct mp_config_ioapic mp_ioapics[MAX_IO_APICS]; +extern struct mpc_ioapic mp_ioapics[MAX_IO_APICS]; /* # of MP IRQ source entries */ extern int mp_irq_entries; /* MP IRQ source entries */ -extern struct mp_config_intsrc mp_irqs[MAX_IRQ_SOURCES]; +extern struct mpc_intsrc mp_irqs[MAX_IRQ_SOURCES]; /* non-0 if default (table-less) MP configuration */ extern int mpc_default_type; @@ -165,15 +143,6 @@ extern int noioapicreroute; /* 1 if the timer IRQ uses the '8259A Virtual Wire' mode */ extern int timer_through_8259; -static inline void disable_ioapic_setup(void) -{ -#ifdef CONFIG_PCI - noioapicquirk = 1; - noioapicreroute = -1; -#endif - skip_ioapic_setup = 1; -} - /* * If we use the IO-APIC for IRQ routing, disable automatic * assignment of PCI IRQ's. @@ -200,6 +169,12 @@ extern void reinit_intr_remapped_IO_APIC extern void probe_nr_irqs_gsi(void); +extern int setup_ioapic_entry(int apic, int irq, + struct IO_APIC_route_entry *entry, + unsigned int destination, int trigger, + int polarity, int vector); +extern void ioapic_write_entry(int apic, int pin, + struct IO_APIC_route_entry e); #else /* !CONFIG_X86_IO_APIC */ #define io_apic_assign_pci_irqs 0 static const int timer_through_8259 = 0; Index: linux-2.6-tip/arch/x86/include/asm/iommu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/iommu.h +++ linux-2.6-tip/arch/x86/include/asm/iommu.h @@ -3,7 +3,7 @@ extern void pci_iommu_shutdown(void); extern void no_iommu_init(void); -extern struct dma_mapping_ops nommu_dma_ops; +extern struct dma_map_ops nommu_dma_ops; extern int force_iommu, no_iommu; extern int iommu_detected; Index: linux-2.6-tip/arch/x86/include/asm/ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/ipi.h +++ linux-2.6-tip/arch/x86/include/asm/ipi.h @@ -1,6 +1,8 @@ #ifndef _ASM_X86_IPI_H #define _ASM_X86_IPI_H +#ifdef CONFIG_X86_LOCAL_APIC + /* * Copyright 2004 James Cleverdon, IBM. * Subject to the GNU Public License, v.2 @@ -55,8 +57,8 @@ static inline void __xapic_wait_icr_idle cpu_relax(); } -static inline void __send_IPI_shortcut(unsigned int shortcut, int vector, - unsigned int dest) +static inline void +__default_send_IPI_shortcut(unsigned int shortcut, int vector, unsigned int dest) { /* * Subtle. In the case of the 'never do double writes' workaround @@ -87,8 +89,8 @@ static inline void __send_IPI_shortcut(u * This is used to send an IPI with no shorthand notation (the destination is * specified in bits 56 to 63 of the ICR). */ -static inline void __send_IPI_dest_field(unsigned int mask, int vector, - unsigned int dest) +static inline void + __default_send_IPI_dest_field(unsigned int mask, int vector, unsigned int dest) { unsigned long cfg; @@ -117,41 +119,44 @@ static inline void __send_IPI_dest_field native_apic_mem_write(APIC_ICR, cfg); } -static inline void send_IPI_mask_sequence(const struct cpumask *mask, - int vector) -{ - unsigned long flags; - unsigned long query_cpu; +extern void default_send_IPI_mask_sequence_phys(const struct cpumask *mask, + int vector); +extern void default_send_IPI_mask_allbutself_phys(const struct cpumask *mask, + int vector); +extern void default_send_IPI_mask_sequence_logical(const struct cpumask *mask, + int vector); +extern void default_send_IPI_mask_allbutself_logical(const struct cpumask *mask, + int vector); + +/* Avoid include hell */ +#define NMI_VECTOR 0x02 - /* - * Hack. The clustered APIC addressing mode doesn't allow us to send - * to an arbitrary mask, so I do a unicast to each CPU instead. - * - mbligh - */ - local_irq_save(flags); - for_each_cpu(query_cpu, mask) { - __send_IPI_dest_field(per_cpu(x86_cpu_to_apicid, query_cpu), - vector, APIC_DEST_PHYSICAL); - } - local_irq_restore(flags); +extern int no_broadcast; + +static inline void __default_local_send_IPI_allbutself(int vector) +{ + if (no_broadcast || vector == NMI_VECTOR) + apic->send_IPI_mask_allbutself(cpu_online_mask, vector); + else + __default_send_IPI_shortcut(APIC_DEST_ALLBUT, vector, apic->dest_logical); } -static inline void send_IPI_mask_allbutself(const struct cpumask *mask, - int vector) +static inline void __default_local_send_IPI_all(int vector) { - unsigned long flags; - unsigned int query_cpu; - unsigned int this_cpu = smp_processor_id(); + if (no_broadcast || vector == NMI_VECTOR) + apic->send_IPI_mask(cpu_online_mask, vector); + else + __default_send_IPI_shortcut(APIC_DEST_ALLINC, vector, apic->dest_logical); +} - /* See Hack comment above */ +#ifdef CONFIG_X86_32 +extern void default_send_IPI_mask_logical(const struct cpumask *mask, + int vector); +extern void default_send_IPI_allbutself(int vector); +extern void default_send_IPI_all(int vector); +extern void default_send_IPI_self(int vector); +#endif - local_irq_save(flags); - for_each_cpu(query_cpu, mask) - if (query_cpu != this_cpu) - __send_IPI_dest_field( - per_cpu(x86_cpu_to_apicid, query_cpu), - vector, APIC_DEST_PHYSICAL); - local_irq_restore(flags); -} +#endif #endif /* _ASM_X86_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/irq.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/irq.h +++ linux-2.6-tip/arch/x86/include/asm/irq.h @@ -36,9 +36,11 @@ static inline int irq_canonicalize(int i extern void fixup_irqs(void); #endif -extern unsigned int do_IRQ(struct pt_regs *regs); extern void init_IRQ(void); extern void native_init_IRQ(void); +extern bool handle_irq(unsigned irq, struct pt_regs *regs); + +extern unsigned int do_IRQ(struct pt_regs *regs); /* Interrupt vector management */ extern DECLARE_BITMAP(used_vectors, NR_VECTORS); Index: linux-2.6-tip/arch/x86/include/asm/irq_regs.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/irq_regs.h +++ linux-2.6-tip/arch/x86/include/asm/irq_regs.h @@ -1,5 +1,31 @@ -#ifdef CONFIG_X86_32 -# include "irq_regs_32.h" -#else -# include "irq_regs_64.h" -#endif +/* + * Per-cpu current frame pointer - the location of the last exception frame on + * the stack, stored in the per-cpu area. + * + * Jeremy Fitzhardinge + */ +#ifndef _ASM_X86_IRQ_REGS_H +#define _ASM_X86_IRQ_REGS_H + +#include + +#define ARCH_HAS_OWN_IRQ_REGS + +DECLARE_PER_CPU(struct pt_regs *, irq_regs); + +static inline struct pt_regs *get_irq_regs(void) +{ + return percpu_read(irq_regs); +} + +static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) +{ + struct pt_regs *old_regs; + + old_regs = get_irq_regs(); + percpu_write(irq_regs, new_regs); + + return old_regs; +} + +#endif /* _ASM_X86_IRQ_REGS_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/irq_regs_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/irq_regs_32.h +++ /dev/null @@ -1,31 +0,0 @@ -/* - * Per-cpu current frame pointer - the location of the last exception frame on - * the stack, stored in the per-cpu area. - * - * Jeremy Fitzhardinge - */ -#ifndef _ASM_X86_IRQ_REGS_32_H -#define _ASM_X86_IRQ_REGS_32_H - -#include - -#define ARCH_HAS_OWN_IRQ_REGS - -DECLARE_PER_CPU(struct pt_regs *, irq_regs); - -static inline struct pt_regs *get_irq_regs(void) -{ - return x86_read_percpu(irq_regs); -} - -static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) -{ - struct pt_regs *old_regs; - - old_regs = get_irq_regs(); - x86_write_percpu(irq_regs, new_regs); - - return old_regs; -} - -#endif /* _ASM_X86_IRQ_REGS_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/irq_regs_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/irq_regs_64.h +++ /dev/null @@ -1 +0,0 @@ -#include Index: linux-2.6-tip/arch/x86/include/asm/irq_vectors.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/irq_vectors.h +++ linux-2.6-tip/arch/x86/include/asm/irq_vectors.h @@ -1,47 +1,69 @@ #ifndef _ASM_X86_IRQ_VECTORS_H #define _ASM_X86_IRQ_VECTORS_H -#include +/* + * Linux IRQ vector layout. + * + * There are 256 IDT entries (per CPU - each entry is 8 bytes) which can + * be defined by Linux. They are used as a jump table by the CPU when a + * given vector is triggered - by a CPU-external, CPU-internal or + * software-triggered event. + * + * Linux sets the kernel code address each entry jumps to early during + * bootup, and never changes them. This is the general layout of the + * IDT entries: + * + * Vectors 0 ... 31 : system traps and exceptions - hardcoded events + * Vectors 32 ... 127 : device interrupts + * Vector 128 : legacy int80 syscall interface + * Vectors 129 ... 237 : device interrupts + * Vectors 238 ... 255 : special interrupts + * + * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table. + * + * This file enumerates the exact layout of them: + */ -#define NMI_VECTOR 0x02 +#define NMI_VECTOR 0x02 /* * IDT vectors usable for external interrupt sources start * at 0x20: */ -#define FIRST_EXTERNAL_VECTOR 0x20 +#define FIRST_EXTERNAL_VECTOR 0x20 #ifdef CONFIG_X86_32 -# define SYSCALL_VECTOR 0x80 +# define SYSCALL_VECTOR 0x80 #else -# define IA32_SYSCALL_VECTOR 0x80 +# define IA32_SYSCALL_VECTOR 0x80 #endif /* * Reserve the lowest usable priority level 0x20 - 0x2f for triggering * cleanup after irq migration. */ -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR +#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR /* * Vectors 0x30-0x3f are used for ISA interrupts. */ -#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10) -#define IRQ1_VECTOR (IRQ0_VECTOR + 1) -#define IRQ2_VECTOR (IRQ0_VECTOR + 2) -#define IRQ3_VECTOR (IRQ0_VECTOR + 3) -#define IRQ4_VECTOR (IRQ0_VECTOR + 4) -#define IRQ5_VECTOR (IRQ0_VECTOR + 5) -#define IRQ6_VECTOR (IRQ0_VECTOR + 6) -#define IRQ7_VECTOR (IRQ0_VECTOR + 7) -#define IRQ8_VECTOR (IRQ0_VECTOR + 8) -#define IRQ9_VECTOR (IRQ0_VECTOR + 9) -#define IRQ10_VECTOR (IRQ0_VECTOR + 10) -#define IRQ11_VECTOR (IRQ0_VECTOR + 11) -#define IRQ12_VECTOR (IRQ0_VECTOR + 12) -#define IRQ13_VECTOR (IRQ0_VECTOR + 13) -#define IRQ14_VECTOR (IRQ0_VECTOR + 14) -#define IRQ15_VECTOR (IRQ0_VECTOR + 15) +#define IRQ0_VECTOR (FIRST_EXTERNAL_VECTOR + 0x10) + +#define IRQ1_VECTOR (IRQ0_VECTOR + 1) +#define IRQ2_VECTOR (IRQ0_VECTOR + 2) +#define IRQ3_VECTOR (IRQ0_VECTOR + 3) +#define IRQ4_VECTOR (IRQ0_VECTOR + 4) +#define IRQ5_VECTOR (IRQ0_VECTOR + 5) +#define IRQ6_VECTOR (IRQ0_VECTOR + 6) +#define IRQ7_VECTOR (IRQ0_VECTOR + 7) +#define IRQ8_VECTOR (IRQ0_VECTOR + 8) +#define IRQ9_VECTOR (IRQ0_VECTOR + 9) +#define IRQ10_VECTOR (IRQ0_VECTOR + 10) +#define IRQ11_VECTOR (IRQ0_VECTOR + 11) +#define IRQ12_VECTOR (IRQ0_VECTOR + 12) +#define IRQ13_VECTOR (IRQ0_VECTOR + 13) +#define IRQ14_VECTOR (IRQ0_VECTOR + 14) +#define IRQ15_VECTOR (IRQ0_VECTOR + 15) /* * Special IRQ vectors used by the SMP architecture, 0xf0-0xff @@ -49,119 +71,98 @@ * some of the following vectors are 'rare', they are merged * into a single vector (CALL_FUNCTION_VECTOR) to save vector space. * TLB, reschedule and local APIC vectors are performance-critical. - * - * Vectors 0xf0-0xfa are free (reserved for future Linux use). */ -#ifdef CONFIG_X86_32 - -# define SPURIOUS_APIC_VECTOR 0xff -# define ERROR_APIC_VECTOR 0xfe -# define INVALIDATE_TLB_VECTOR 0xfd -# define RESCHEDULE_VECTOR 0xfc -# define CALL_FUNCTION_VECTOR 0xfb -# define CALL_FUNCTION_SINGLE_VECTOR 0xfa -# define THERMAL_APIC_VECTOR 0xf0 - -#else #define SPURIOUS_APIC_VECTOR 0xff +/* + * Sanity check + */ +#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F) +# error SPURIOUS_APIC_VECTOR definition error +#endif + #define ERROR_APIC_VECTOR 0xfe #define RESCHEDULE_VECTOR 0xfd #define CALL_FUNCTION_VECTOR 0xfc #define CALL_FUNCTION_SINGLE_VECTOR 0xfb #define THERMAL_APIC_VECTOR 0xfa -#define THRESHOLD_APIC_VECTOR 0xf9 -#define UV_BAU_MESSAGE 0xf8 -#define INVALIDATE_TLB_VECTOR_END 0xf7 -#define INVALIDATE_TLB_VECTOR_START 0xf0 /* f0-f7 used for TLB flush */ - -#define NUM_INVALIDATE_TLB_VECTORS 8 +#ifdef CONFIG_X86_32 +/* 0xf8 - 0xf9 : free */ +#else +# define THRESHOLD_APIC_VECTOR 0xf9 +# define UV_BAU_MESSAGE 0xf8 #endif +/* f0-f7 used for spreading out TLB flushes: */ +#define INVALIDATE_TLB_VECTOR_END 0xf7 +#define INVALIDATE_TLB_VECTOR_START 0xf0 +#define NUM_INVALIDATE_TLB_VECTORS 8 + /* * Local APIC timer IRQ vector is on a different priority level, * to work around the 'lost local interrupt if more than 2 IRQ * sources per level' errata. */ -#define LOCAL_TIMER_VECTOR 0xef +#define LOCAL_TIMER_VECTOR 0xef + +/* + * Performance monitoring interrupt vector: + */ +#define LOCAL_PERF_VECTOR 0xee /* * First APIC vector available to drivers: (vectors 0x30-0xee) we * start at 0x31(0x41) to spread out vectors evenly between priority * levels. (0x80 is the syscall vector) */ -#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2) - -#define NR_VECTORS 256 +#define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2) -#define FPU_IRQ 13 +#define NR_VECTORS 256 -#define FIRST_VM86_IRQ 3 -#define LAST_VM86_IRQ 15 -#define invalid_vm86_irq(irq) ((irq) < 3 || (irq) > 15) +#define FPU_IRQ 13 -#define NR_IRQS_LEGACY 16 +#define FIRST_VM86_IRQ 3 +#define LAST_VM86_IRQ 15 -#if defined(CONFIG_X86_IO_APIC) && !defined(CONFIG_X86_VOYAGER) - -#ifndef CONFIG_SPARSE_IRQ -# if NR_CPUS < MAX_IO_APICS -# define NR_IRQS (NR_VECTORS + (32 * NR_CPUS)) -# else -# define NR_IRQS (NR_VECTORS + (32 * MAX_IO_APICS)) -# endif -#else -# if (8 * NR_CPUS) > (32 * MAX_IO_APICS) -# define NR_IRQS (NR_VECTORS + (8 * NR_CPUS)) -# else -# define NR_IRQS (NR_VECTORS + (32 * MAX_IO_APICS)) -# endif +#ifndef __ASSEMBLY__ +static inline int invalid_vm86_irq(int irq) +{ + return irq < 3 || irq > 15; +} #endif -#elif defined(CONFIG_X86_VOYAGER) - -# define NR_IRQS 224 +/* + * Size the maximum number of interrupts. + * + * If the irq_desc[] array has a sparse layout, we can size things + * generously - it scales up linearly with the maximum number of CPUs, + * and the maximum number of IO-APICs, whichever is higher. + * + * In other cases we size more conservatively, to not create too large + * static arrays. + */ -#else /* IO_APIC || VOYAGER */ +#define NR_IRQS_LEGACY 16 -# define NR_IRQS 16 +#define CPU_VECTOR_LIMIT ( 8 * NR_CPUS ) +#define IO_APIC_VECTOR_LIMIT ( 32 * MAX_IO_APICS ) +#ifdef CONFIG_X86_IO_APIC +# ifdef CONFIG_SPARSE_IRQ +# define NR_IRQS \ + (CPU_VECTOR_LIMIT > IO_APIC_VECTOR_LIMIT ? \ + (NR_VECTORS + CPU_VECTOR_LIMIT) : \ + (NR_VECTORS + IO_APIC_VECTOR_LIMIT)) +# else +# if NR_CPUS < MAX_IO_APICS +# define NR_IRQS (NR_VECTORS + 4*CPU_VECTOR_LIMIT) +# else +# define NR_IRQS (NR_VECTORS + IO_APIC_VECTOR_LIMIT) +# endif +# endif +#else /* !CONFIG_X86_IO_APIC: */ +# define NR_IRQS NR_IRQS_LEGACY #endif -/* Voyager specific defines */ -/* These define the CPIs we use in linux */ -#define VIC_CPI_LEVEL0 0 -#define VIC_CPI_LEVEL1 1 -/* now the fake CPIs */ -#define VIC_TIMER_CPI 2 -#define VIC_INVALIDATE_CPI 3 -#define VIC_RESCHEDULE_CPI 4 -#define VIC_ENABLE_IRQ_CPI 5 -#define VIC_CALL_FUNCTION_CPI 6 -#define VIC_CALL_FUNCTION_SINGLE_CPI 7 - -/* Now the QIC CPIs: Since we don't need the two initial levels, - * these are 2 less than the VIC CPIs */ -#define QIC_CPI_OFFSET 1 -#define QIC_TIMER_CPI (VIC_TIMER_CPI - QIC_CPI_OFFSET) -#define QIC_INVALIDATE_CPI (VIC_INVALIDATE_CPI - QIC_CPI_OFFSET) -#define QIC_RESCHEDULE_CPI (VIC_RESCHEDULE_CPI - QIC_CPI_OFFSET) -#define QIC_ENABLE_IRQ_CPI (VIC_ENABLE_IRQ_CPI - QIC_CPI_OFFSET) -#define QIC_CALL_FUNCTION_CPI (VIC_CALL_FUNCTION_CPI - QIC_CPI_OFFSET) -#define QIC_CALL_FUNCTION_SINGLE_CPI (VIC_CALL_FUNCTION_SINGLE_CPI - QIC_CPI_OFFSET) - -#define VIC_START_FAKE_CPI VIC_TIMER_CPI -#define VIC_END_FAKE_CPI VIC_CALL_FUNCTION_SINGLE_CPI - -/* this is the SYS_INT CPI. */ -#define VIC_SYS_INT 8 -#define VIC_CMN_INT 15 - -/* This is the boot CPI for alternate processors. It gets overwritten - * by the above once the system has activated all available processors */ -#define VIC_CPU_BOOT_CPI VIC_CPI_LEVEL0 -#define VIC_CPU_BOOT_ERRATA_CPI (VIC_CPI_LEVEL0 + 8) - - #endif /* _ASM_X86_IRQ_VECTORS_H */ Index: linux-2.6-tip/arch/x86/include/asm/kexec.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/kexec.h +++ linux-2.6-tip/arch/x86/include/asm/kexec.h @@ -9,23 +9,8 @@ # define PAGES_NR 4 #else # define PA_CONTROL_PAGE 0 -# define VA_CONTROL_PAGE 1 -# define PA_PGD 2 -# define VA_PGD 3 -# define PA_PUD_0 4 -# define VA_PUD_0 5 -# define PA_PMD_0 6 -# define VA_PMD_0 7 -# define PA_PTE_0 8 -# define VA_PTE_0 9 -# define PA_PUD_1 10 -# define VA_PUD_1 11 -# define PA_PMD_1 12 -# define VA_PMD_1 13 -# define PA_PTE_1 14 -# define VA_PTE_1 15 -# define PA_TABLE_PAGE 16 -# define PAGES_NR 17 +# define PA_TABLE_PAGE 1 +# define PAGES_NR 2 #endif #ifdef CONFIG_X86_32 @@ -157,9 +142,9 @@ relocate_kernel(unsigned long indirectio unsigned long start_address) ATTRIB_NORET; #endif -#ifdef CONFIG_X86_32 #define ARCH_HAS_KIMAGE_ARCH +#ifdef CONFIG_X86_32 struct kimage_arch { pgd_t *pgd; #ifdef CONFIG_X86_PAE @@ -169,6 +154,12 @@ struct kimage_arch { pte_t *pte0; pte_t *pte1; }; +#else +struct kimage_arch { + pud_t *pud; + pmd_t *pmd; + pte_t *pte; +}; #endif #endif /* __ASSEMBLY__ */ Index: linux-2.6-tip/arch/x86/include/asm/kmemcheck.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/kmemcheck.h @@ -0,0 +1,42 @@ +#ifndef ASM_X86_KMEMCHECK_H +#define ASM_X86_KMEMCHECK_H + +#include +#include + +#ifdef CONFIG_KMEMCHECK +bool kmemcheck_active(struct pt_regs *regs); + +void kmemcheck_show(struct pt_regs *regs); +void kmemcheck_hide(struct pt_regs *regs); + +bool kmemcheck_fault(struct pt_regs *regs, + unsigned long address, unsigned long error_code); +bool kmemcheck_trap(struct pt_regs *regs); +#else +static inline bool kmemcheck_active(struct pt_regs *regs) +{ + return false; +} + +static inline void kmemcheck_show(struct pt_regs *regs) +{ +} + +static inline void kmemcheck_hide(struct pt_regs *regs) +{ +} + +static inline bool kmemcheck_fault(struct pt_regs *regs, + unsigned long address, unsigned long error_code) +{ + return false; +} + +static inline bool kmemcheck_trap(struct pt_regs *regs) +{ + return false; +} +#endif /* CONFIG_KMEMCHECK */ + +#endif Index: linux-2.6-tip/arch/x86/include/asm/linkage.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/linkage.h +++ linux-2.6-tip/arch/x86/include/asm/linkage.h @@ -52,70 +52,14 @@ #endif +#define GLOBAL(name) \ + .globl name; \ + name: + #ifdef CONFIG_X86_ALIGNMENT_16 #define __ALIGN .align 16,0x90 #define __ALIGN_STR ".align 16,0x90" #endif -/* - * to check ENTRY_X86/END_X86 and - * KPROBE_ENTRY_X86/KPROBE_END_X86 - * unbalanced-missed-mixed appearance - */ -#define __set_entry_x86 .set ENTRY_X86_IN, 0 -#define __unset_entry_x86 .set ENTRY_X86_IN, 1 -#define __set_kprobe_x86 .set KPROBE_X86_IN, 0 -#define __unset_kprobe_x86 .set KPROBE_X86_IN, 1 - -#define __macro_err_x86 .error "ENTRY_X86/KPROBE_X86 unbalanced,missed,mixed" - -#define __check_entry_x86 \ - .ifdef ENTRY_X86_IN; \ - .ifeq ENTRY_X86_IN; \ - __macro_err_x86; \ - .abort; \ - .endif; \ - .endif - -#define __check_kprobe_x86 \ - .ifdef KPROBE_X86_IN; \ - .ifeq KPROBE_X86_IN; \ - __macro_err_x86; \ - .abort; \ - .endif; \ - .endif - -#define __check_entry_kprobe_x86 \ - __check_entry_x86; \ - __check_kprobe_x86 - -#define ENTRY_KPROBE_FINAL_X86 __check_entry_kprobe_x86 - -#define ENTRY_X86(name) \ - __check_entry_kprobe_x86; \ - __set_entry_x86; \ - .globl name; \ - __ALIGN; \ - name: - -#define END_X86(name) \ - __unset_entry_x86; \ - __check_entry_kprobe_x86; \ - .size name, .-name - -#define KPROBE_ENTRY_X86(name) \ - __check_entry_kprobe_x86; \ - __set_kprobe_x86; \ - .pushsection .kprobes.text, "ax"; \ - .globl name; \ - __ALIGN; \ - name: - -#define KPROBE_END_X86(name) \ - __unset_kprobe_x86; \ - __check_entry_kprobe_x86; \ - .size name, .-name; \ - .popsection - #endif /* _ASM_X86_LINKAGE_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/apm.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/apm.h +++ /dev/null @@ -1,73 +0,0 @@ -/* - * Machine specific APM BIOS functions for generic. - * Split out from apm.c by Osamu Tomita - */ - -#ifndef _ASM_X86_MACH_DEFAULT_APM_H -#define _ASM_X86_MACH_DEFAULT_APM_H - -#ifdef APM_ZERO_SEGS -# define APM_DO_ZERO_SEGS \ - "pushl %%ds\n\t" \ - "pushl %%es\n\t" \ - "xorl %%edx, %%edx\n\t" \ - "mov %%dx, %%ds\n\t" \ - "mov %%dx, %%es\n\t" \ - "mov %%dx, %%fs\n\t" \ - "mov %%dx, %%gs\n\t" -# define APM_DO_POP_SEGS \ - "popl %%es\n\t" \ - "popl %%ds\n\t" -#else -# define APM_DO_ZERO_SEGS -# define APM_DO_POP_SEGS -#endif - -static inline void apm_bios_call_asm(u32 func, u32 ebx_in, u32 ecx_in, - u32 *eax, u32 *ebx, u32 *ecx, - u32 *edx, u32 *esi) -{ - /* - * N.B. We do NOT need a cld after the BIOS call - * because we always save and restore the flags. - */ - __asm__ __volatile__(APM_DO_ZERO_SEGS - "pushl %%edi\n\t" - "pushl %%ebp\n\t" - "lcall *%%cs:apm_bios_entry\n\t" - "setc %%al\n\t" - "popl %%ebp\n\t" - "popl %%edi\n\t" - APM_DO_POP_SEGS - : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx), - "=S" (*esi) - : "a" (func), "b" (ebx_in), "c" (ecx_in) - : "memory", "cc"); -} - -static inline u8 apm_bios_call_simple_asm(u32 func, u32 ebx_in, - u32 ecx_in, u32 *eax) -{ - int cx, dx, si; - u8 error; - - /* - * N.B. We do NOT need a cld after the BIOS call - * because we always save and restore the flags. - */ - __asm__ __volatile__(APM_DO_ZERO_SEGS - "pushl %%edi\n\t" - "pushl %%ebp\n\t" - "lcall *%%cs:apm_bios_entry\n\t" - "setc %%bl\n\t" - "popl %%ebp\n\t" - "popl %%edi\n\t" - APM_DO_POP_SEGS - : "=a" (*eax), "=b" (error), "=c" (cx), "=d" (dx), - "=S" (si) - : "a" (func), "b" (ebx_in), "c" (ecx_in) - : "memory", "cc"); - return error; -} - -#endif /* _ASM_X86_MACH_DEFAULT_APM_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/do_timer.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/do_timer.h +++ /dev/null @@ -1,16 +0,0 @@ -/* defines for inline arch setup functions */ -#include - -#include -#include - -/** - * do_timer_interrupt_hook - hook into timer tick - * - * Call the pit clock event handler. see asm/i8253.h - **/ - -static inline void do_timer_interrupt_hook(void) -{ - global_clock_event->event_handler(global_clock_event); -} Index: linux-2.6-tip/arch/x86/include/asm/mach-default/entry_arch.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/entry_arch.h +++ /dev/null @@ -1,36 +0,0 @@ -/* - * This file is designed to contain the BUILD_INTERRUPT specifications for - * all of the extra named interrupt vectors used by the architecture. - * Usually this is the Inter Process Interrupts (IPIs) - */ - -/* - * The following vectors are part of the Linux architecture, there - * is no hardware IRQ pin equivalent for them, they are triggered - * through the ICC by us (IPIs) - */ -#ifdef CONFIG_X86_SMP -BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR) -BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR) -BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR) -BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR) -BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR) -#endif - -/* - * every pentium local APIC has two 'local interrupts', with a - * soft-definable vector attached to both interrupts, one of - * which is a timer interrupt, the other one is error counter - * overflow. Linux uses the local APIC timer interrupt to get - * a much simpler SMP time architecture: - */ -#ifdef CONFIG_X86_LOCAL_APIC -BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR) -BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR) -BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR) - -#ifdef CONFIG_X86_MCE_P4THERMAL -BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR) -#endif - -#endif Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_apic.h +++ /dev/null @@ -1,168 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_APIC_H -#define _ASM_X86_MACH_DEFAULT_MACH_APIC_H - -#ifdef CONFIG_X86_LOCAL_APIC - -#include -#include - -#define APIC_DFR_VALUE (APIC_DFR_FLAT) - -static inline const struct cpumask *target_cpus(void) -{ -#ifdef CONFIG_SMP - return cpu_online_mask; -#else - return cpumask_of(0); -#endif -} - -#define NO_BALANCE_IRQ (0) -#define esr_disable (0) - -#ifdef CONFIG_X86_64 -#include -#define INT_DELIVERY_MODE (genapic->int_delivery_mode) -#define INT_DEST_MODE (genapic->int_dest_mode) -#define TARGET_CPUS (genapic->target_cpus()) -#define apic_id_registered (genapic->apic_id_registered) -#define init_apic_ldr (genapic->init_apic_ldr) -#define cpu_mask_to_apicid (genapic->cpu_mask_to_apicid) -#define cpu_mask_to_apicid_and (genapic->cpu_mask_to_apicid_and) -#define phys_pkg_id (genapic->phys_pkg_id) -#define vector_allocation_domain (genapic->vector_allocation_domain) -#define read_apic_id() (GET_APIC_ID(apic_read(APIC_ID))) -#define send_IPI_self (genapic->send_IPI_self) -#define wakeup_secondary_cpu (genapic->wakeup_cpu) -extern void setup_apic_routing(void); -#else -#define INT_DELIVERY_MODE dest_LowestPrio -#define INT_DEST_MODE 1 /* logical delivery broadcast to all procs */ -#define TARGET_CPUS (target_cpus()) -#define wakeup_secondary_cpu wakeup_secondary_cpu_via_init -/* - * Set up the logical destination ID. - * - * Intel recommends to set DFR, LDR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ -static inline void init_apic_ldr(void) -{ - unsigned long val; - - apic_write(APIC_DFR, APIC_DFR_VALUE); - val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; - val |= SET_APIC_LOGICAL_ID(1UL << smp_processor_id()); - apic_write(APIC_LDR, val); -} - -static inline int apic_id_registered(void) -{ - return physid_isset(read_apic_id(), phys_cpu_present_map); -} - -static inline unsigned int cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - return cpumask_bits(cpumask)[0]; -} - -static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - unsigned long mask1 = cpumask_bits(cpumask)[0]; - unsigned long mask2 = cpumask_bits(andmask)[0]; - unsigned long mask3 = cpumask_bits(cpu_online_mask)[0]; - - return (unsigned int)(mask1 & mask2 & mask3); -} - -static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb) -{ - return cpuid_apic >> index_msb; -} - -static inline void setup_apic_routing(void) -{ -#ifdef CONFIG_X86_IO_APIC - printk("Enabling APIC mode: %s. Using %d I/O APICs\n", - "Flat", nr_ioapics); -#endif -} - -static inline int apicid_to_node(int logical_apicid) -{ -#ifdef CONFIG_SMP - return apicid_2_node[hard_smp_processor_id()]; -#else - return 0; -#endif -} - -static inline void vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - /* Careful. Some cpus do not strictly honor the set of cpus - * specified in the interrupt destination when using lowest - * priority interrupt delivery mode. - * - * In particular there was a hyperthreading cpu observed to - * deliver interrupts to the wrong hyperthread when only one - * hyperthread was specified in the interrupt desitination. - */ - *retmask = (cpumask_t) { { [0] = APIC_ALL_CPUS } }; -} -#endif - -static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) -{ - return physid_isset(apicid, bitmap); -} - -static inline unsigned long check_apicid_present(int bit) -{ - return physid_isset(bit, phys_cpu_present_map); -} - -static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_map) -{ - return phys_map; -} - -static inline int multi_timer_check(int apic, int irq) -{ - return 0; -} - -/* Mapping from cpu number to logical apicid */ -static inline int cpu_to_logical_apicid(int cpu) -{ - return 1 << cpu; -} - -static inline int cpu_present_to_apicid(int mps_cpu) -{ - if (mps_cpu < nr_cpu_ids && cpu_present(mps_cpu)) - return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); - else - return BAD_APICID; -} - -static inline physid_mask_t apicid_to_cpu_present(int phys_apicid) -{ - return physid_mask_of_physid(phys_apicid); -} - -static inline void setup_portio_remap(void) -{ -} - -static inline int check_phys_apicid_present(int boot_cpu_physical_apicid) -{ - return physid_isset(boot_cpu_physical_apicid, phys_cpu_present_map); -} - -static inline void enable_apic_mode(void) -{ -} -#endif /* CONFIG_X86_LOCAL_APIC */ -#endif /* _ASM_X86_MACH_DEFAULT_MACH_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_apicdef.h +++ /dev/null @@ -1,24 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_APICDEF_H -#define _ASM_X86_MACH_DEFAULT_MACH_APICDEF_H - -#include - -#ifdef CONFIG_X86_64 -#define APIC_ID_MASK (genapic->apic_id_mask) -#define GET_APIC_ID(x) (genapic->get_apic_id(x)) -#define SET_APIC_ID(x) (genapic->set_apic_id(x)) -#else -#define APIC_ID_MASK (0xF<<24) -static inline unsigned get_apic_id(unsigned long x) -{ - unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR)); - if (APIC_XAPIC(ver)) - return (((x)>>24)&0xFF); - else - return (((x)>>24)&0xF); -} - -#define GET_APIC_ID(x) get_apic_id(x) -#endif - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_APICDEF_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_ipi.h +++ /dev/null @@ -1,64 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_IPI_H -#define _ASM_X86_MACH_DEFAULT_MACH_IPI_H - -/* Avoid include hell */ -#define NMI_VECTOR 0x02 - -void send_IPI_mask_bitmask(const struct cpumask *mask, int vector); -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector); -void __send_IPI_shortcut(unsigned int shortcut, int vector); - -extern int no_broadcast; - -#ifdef CONFIG_X86_64 -#include -#define send_IPI_mask (genapic->send_IPI_mask) -#define send_IPI_mask_allbutself (genapic->send_IPI_mask_allbutself) -#else -static inline void send_IPI_mask(const struct cpumask *mask, int vector) -{ - send_IPI_mask_bitmask(mask, vector); -} -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector); -#endif - -static inline void __local_send_IPI_allbutself(int vector) -{ - if (no_broadcast || vector == NMI_VECTOR) - send_IPI_mask_allbutself(cpu_online_mask, vector); - else - __send_IPI_shortcut(APIC_DEST_ALLBUT, vector); -} - -static inline void __local_send_IPI_all(int vector) -{ - if (no_broadcast || vector == NMI_VECTOR) - send_IPI_mask(cpu_online_mask, vector); - else - __send_IPI_shortcut(APIC_DEST_ALLINC, vector); -} - -#ifdef CONFIG_X86_64 -#define send_IPI_allbutself (genapic->send_IPI_allbutself) -#define send_IPI_all (genapic->send_IPI_all) -#else -static inline void send_IPI_allbutself(int vector) -{ - /* - * if there are no other CPUs in the system then we get an APIC send - * error if we try to broadcast, thus avoid sending IPIs in this case. - */ - if (!(num_online_cpus() > 1)) - return; - - __local_send_IPI_allbutself(vector); - return; -} - -static inline void send_IPI_all(int vector) -{ - __local_send_IPI_all(vector); -} -#endif - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_mpparse.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_mpparse.h +++ /dev/null @@ -1,17 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_MPPARSE_H -#define _ASM_X86_MACH_DEFAULT_MACH_MPPARSE_H - -static inline int -mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) -{ - return 0; -} - -/* Hook from generic ACPI tables.c */ -static inline int acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - return 0; -} - - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_MPPARSE_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_mpspec.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_mpspec.h +++ /dev/null @@ -1,12 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_MPSPEC_H -#define _ASM_X86_MACH_DEFAULT_MACH_MPSPEC_H - -#define MAX_IRQ_SOURCES 256 - -#if CONFIG_BASE_SMALL == 0 -#define MAX_MP_BUSSES 256 -#else -#define MAX_MP_BUSSES 32 -#endif - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_MPSPEC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_timer.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_timer.h +++ /dev/null @@ -1,48 +0,0 @@ -/* - * Machine specific calibrate_tsc() for generic. - * Split out from timer_tsc.c by Osamu Tomita - */ -/* ------ Calibrate the TSC ------- - * Return 2^32 * (1 / (TSC clocks per usec)) for do_fast_gettimeoffset(). - * Too much 64-bit arithmetic here to do this cleanly in C, and for - * accuracy's sake we want to keep the overhead on the CTC speaker (channel 2) - * output busy loop as low as possible. We avoid reading the CTC registers - * directly because of the awkward 8-bit access mechanism of the 82C54 - * device. - */ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_TIMER_H -#define _ASM_X86_MACH_DEFAULT_MACH_TIMER_H - -#define CALIBRATE_TIME_MSEC 30 /* 30 msecs */ -#define CALIBRATE_LATCH \ - ((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000) - -static inline void mach_prepare_counter(void) -{ - /* Set the Gate high, disable speaker */ - outb((inb(0x61) & ~0x02) | 0x01, 0x61); - - /* - * Now let's take care of CTC channel 2 - * - * Set the Gate high, program CTC channel 2 for mode 0, - * (interrupt on terminal count mode), binary count, - * load 5 * LATCH count, (LSB and MSB) to begin countdown. - * - * Some devices need a delay here. - */ - outb(0xb0, 0x43); /* binary, mode 0, LSB/MSB, Ch 2 */ - outb_p(CALIBRATE_LATCH & 0xff, 0x42); /* LSB of count */ - outb_p(CALIBRATE_LATCH >> 8, 0x42); /* MSB of count */ -} - -static inline void mach_countup(unsigned long *count_p) -{ - unsigned long count = 0; - do { - count++; - } while ((inb_p(0x61) & 0x20) == 0); - *count_p = count; -} - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_TIMER_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_traps.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_traps.h +++ /dev/null @@ -1,33 +0,0 @@ -/* - * Machine specific NMI handling for generic. - * Split out from traps.c by Osamu Tomita - */ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H -#define _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H - -#include - -static inline unsigned char get_nmi_reason(void) -{ - return inb(0x61); -} - -static inline void reassert_nmi(void) -{ - int old_reg = -1; - - if (do_i_have_lock_cmos()) - old_reg = current_lock_cmos_reg(); - else - lock_cmos(0); /* register doesn't matter here */ - outb(0x8f, 0x70); - inb(0x71); /* dummy */ - outb(0x0f, 0x70); - inb(0x71); /* dummy */ - if (old_reg >= 0) - outb(old_reg, 0x70); - else - unlock_cmos(); -} - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/mach_wakecpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/mach_wakecpu.h +++ /dev/null @@ -1,41 +0,0 @@ -#ifndef _ASM_X86_MACH_DEFAULT_MACH_WAKECPU_H -#define _ASM_X86_MACH_DEFAULT_MACH_WAKECPU_H - -#define TRAMPOLINE_PHYS_LOW (0x467) -#define TRAMPOLINE_PHYS_HIGH (0x469) - -static inline void wait_for_init_deassert(atomic_t *deassert) -{ - while (!atomic_read(deassert)) - cpu_relax(); - return; -} - -/* Nothing to do for most platforms, since cleared by the INIT cycle */ -static inline void smp_callin_clear_local_apic(void) -{ -} - -static inline void store_NMI_vector(unsigned short *high, unsigned short *low) -{ -} - -static inline void restore_NMI_vector(unsigned short *high, unsigned short *low) -{ -} - -#ifdef CONFIG_SMP -extern void __inquire_remote_apic(int apicid); -#else /* CONFIG_SMP */ -static inline void __inquire_remote_apic(int apicid) -{ -} -#endif /* CONFIG_SMP */ - -static inline void inquire_remote_apic(int apicid) -{ - if (apic_verbosity >= APIC_DEBUG) - __inquire_remote_apic(apicid); -} - -#endif /* _ASM_X86_MACH_DEFAULT_MACH_WAKECPU_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/pci-functions.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/pci-functions.h +++ /dev/null @@ -1,19 +0,0 @@ -/* - * PCI BIOS function numbering for conventional PCI BIOS - * systems - */ - -#define PCIBIOS_PCI_FUNCTION_ID 0xb1XX -#define PCIBIOS_PCI_BIOS_PRESENT 0xb101 -#define PCIBIOS_FIND_PCI_DEVICE 0xb102 -#define PCIBIOS_FIND_PCI_CLASS_CODE 0xb103 -#define PCIBIOS_GENERATE_SPECIAL_CYCLE 0xb106 -#define PCIBIOS_READ_CONFIG_BYTE 0xb108 -#define PCIBIOS_READ_CONFIG_WORD 0xb109 -#define PCIBIOS_READ_CONFIG_DWORD 0xb10a -#define PCIBIOS_WRITE_CONFIG_BYTE 0xb10b -#define PCIBIOS_WRITE_CONFIG_WORD 0xb10c -#define PCIBIOS_WRITE_CONFIG_DWORD 0xb10d -#define PCIBIOS_GET_ROUTING_OPTIONS 0xb10e -#define PCIBIOS_SET_PCI_HW_INT 0xb10f - Index: linux-2.6-tip/arch/x86/include/asm/mach-default/setup_arch.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/setup_arch.h +++ /dev/null @@ -1,3 +0,0 @@ -/* Hook to call BIOS initialisation function */ - -/* no action for generic */ Index: linux-2.6-tip/arch/x86/include/asm/mach-default/smpboot_hooks.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-default/smpboot_hooks.h +++ /dev/null @@ -1,61 +0,0 @@ -/* two abstractions specific to kernel/smpboot.c, mainly to cater to visws - * which needs to alter them. */ - -static inline void smpboot_clear_io_apic_irqs(void) -{ -#ifdef CONFIG_X86_IO_APIC - io_apic_irqs = 0; -#endif -} - -static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip) -{ - CMOS_WRITE(0xa, 0xf); - local_flush_tlb(); - pr_debug("1.\n"); - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_HIGH)) = - start_eip >> 4; - pr_debug("2.\n"); - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = - start_eip & 0xf; - pr_debug("3.\n"); -} - -static inline void smpboot_restore_warm_reset_vector(void) -{ - /* - * Install writable page 0 entry to set BIOS data area. - */ - local_flush_tlb(); - - /* - * Paranoid: Set warm reset code and vector here back - * to default values. - */ - CMOS_WRITE(0, 0xf); - - *((volatile long *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; -} - -static inline void __init smpboot_setup_io_apic(void) -{ -#ifdef CONFIG_X86_IO_APIC - /* - * Here we can be sure that there is an IO-APIC in the system. Let's - * go and set it up: - */ - if (!skip_ioapic_setup && nr_ioapics) - setup_IO_APIC(); - else { - nr_ioapics = 0; - localise_nmi_watchdog(); - } -#endif -} - -static inline void smpboot_clear_io_apic(void) -{ -#ifdef CONFIG_X86_IO_APIC - nr_ioapics = 0; -#endif -} Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/gpio.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/gpio.h +++ /dev/null @@ -1,15 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_GPIO_H -#define _ASM_X86_MACH_GENERIC_GPIO_H - -int gpio_request(unsigned gpio, const char *label); -void gpio_free(unsigned gpio); -int gpio_direction_input(unsigned gpio); -int gpio_direction_output(unsigned gpio, int value); -int gpio_get_value(unsigned gpio); -void gpio_set_value(unsigned gpio, int value); -int gpio_to_irq(unsigned gpio); -int irq_to_gpio(unsigned irq); - -#include /* cansleep wrappers */ - -#endif /* _ASM_X86_MACH_GENERIC_GPIO_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_apic.h +++ /dev/null @@ -1,35 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_APIC_H -#define _ASM_X86_MACH_GENERIC_MACH_APIC_H - -#include - -#define esr_disable (genapic->ESR_DISABLE) -#define NO_BALANCE_IRQ (genapic->no_balance_irq) -#define INT_DELIVERY_MODE (genapic->int_delivery_mode) -#define INT_DEST_MODE (genapic->int_dest_mode) -#undef APIC_DEST_LOGICAL -#define APIC_DEST_LOGICAL (genapic->apic_destination_logical) -#define TARGET_CPUS (genapic->target_cpus()) -#define apic_id_registered (genapic->apic_id_registered) -#define init_apic_ldr (genapic->init_apic_ldr) -#define ioapic_phys_id_map (genapic->ioapic_phys_id_map) -#define setup_apic_routing (genapic->setup_apic_routing) -#define multi_timer_check (genapic->multi_timer_check) -#define apicid_to_node (genapic->apicid_to_node) -#define cpu_to_logical_apicid (genapic->cpu_to_logical_apicid) -#define cpu_present_to_apicid (genapic->cpu_present_to_apicid) -#define apicid_to_cpu_present (genapic->apicid_to_cpu_present) -#define setup_portio_remap (genapic->setup_portio_remap) -#define check_apicid_present (genapic->check_apicid_present) -#define check_phys_apicid_present (genapic->check_phys_apicid_present) -#define check_apicid_used (genapic->check_apicid_used) -#define cpu_mask_to_apicid (genapic->cpu_mask_to_apicid) -#define cpu_mask_to_apicid_and (genapic->cpu_mask_to_apicid_and) -#define vector_allocation_domain (genapic->vector_allocation_domain) -#define enable_apic_mode (genapic->enable_apic_mode) -#define phys_pkg_id (genapic->phys_pkg_id) -#define wakeup_secondary_cpu (genapic->wakeup_cpu) - -extern void generic_bigsmp_probe(void); - -#endif /* _ASM_X86_MACH_GENERIC_MACH_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_apicdef.h +++ /dev/null @@ -1,11 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_APICDEF_H -#define _ASM_X86_MACH_GENERIC_MACH_APICDEF_H - -#ifndef APIC_DEFINITION -#include - -#define GET_APIC_ID (genapic->get_apic_id) -#define APIC_ID_MASK (genapic->apic_id_mask) -#endif - -#endif /* _ASM_X86_MACH_GENERIC_MACH_APICDEF_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_ipi.h +++ /dev/null @@ -1,10 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_IPI_H -#define _ASM_X86_MACH_GENERIC_MACH_IPI_H - -#include - -#define send_IPI_mask (genapic->send_IPI_mask) -#define send_IPI_allbutself (genapic->send_IPI_allbutself) -#define send_IPI_all (genapic->send_IPI_all) - -#endif /* _ASM_X86_MACH_GENERIC_MACH_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_mpparse.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_mpparse.h +++ /dev/null @@ -1,9 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_MPPARSE_H -#define _ASM_X86_MACH_GENERIC_MACH_MPPARSE_H - - -extern int mps_oem_check(struct mpc_table *, char *, char *); - -extern int acpi_madt_oem_check(char *, char *); - -#endif /* _ASM_X86_MACH_GENERIC_MACH_MPPARSE_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_mpspec.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_mpspec.h +++ /dev/null @@ -1,12 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_MPSPEC_H -#define _ASM_X86_MACH_GENERIC_MACH_MPSPEC_H - -#define MAX_IRQ_SOURCES 256 - -/* Summit or generic (i.e. installer) kernels need lots of bus entries. */ -/* Maximum 256 PCI busses, plus 1 ISA bus in each of 4 cabinets. */ -#define MAX_MP_BUSSES 260 - -extern void numaq_mps_oem_check(struct mpc_table *, char *, char *); - -#endif /* _ASM_X86_MACH_GENERIC_MACH_MPSPEC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-generic/mach_wakecpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-generic/mach_wakecpu.h +++ /dev/null @@ -1,12 +0,0 @@ -#ifndef _ASM_X86_MACH_GENERIC_MACH_WAKECPU_H -#define _ASM_X86_MACH_GENERIC_MACH_WAKECPU_H - -#define TRAMPOLINE_PHYS_LOW (genapic->trampoline_phys_low) -#define TRAMPOLINE_PHYS_HIGH (genapic->trampoline_phys_high) -#define wait_for_init_deassert (genapic->wait_for_init_deassert) -#define smp_callin_clear_local_apic (genapic->smp_callin_clear_local_apic) -#define store_NMI_vector (genapic->store_NMI_vector) -#define restore_NMI_vector (genapic->restore_NMI_vector) -#define inquire_remote_apic (genapic->inquire_remote_apic) - -#endif /* _ASM_X86_MACH_GENERIC_MACH_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-rdc321x/gpio.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-rdc321x/gpio.h +++ /dev/null @@ -1,60 +0,0 @@ -#ifndef _ASM_X86_MACH_RDC321X_GPIO_H -#define _ASM_X86_MACH_RDC321X_GPIO_H - -#include - -extern int rdc_gpio_get_value(unsigned gpio); -extern void rdc_gpio_set_value(unsigned gpio, int value); -extern int rdc_gpio_direction_input(unsigned gpio); -extern int rdc_gpio_direction_output(unsigned gpio, int value); -extern int rdc_gpio_request(unsigned gpio, const char *label); -extern void rdc_gpio_free(unsigned gpio); -extern void __init rdc321x_gpio_setup(void); - -/* Wrappers for the arch-neutral GPIO API */ - -static inline int gpio_request(unsigned gpio, const char *label) -{ - return rdc_gpio_request(gpio, label); -} - -static inline void gpio_free(unsigned gpio) -{ - might_sleep(); - rdc_gpio_free(gpio); -} - -static inline int gpio_direction_input(unsigned gpio) -{ - return rdc_gpio_direction_input(gpio); -} - -static inline int gpio_direction_output(unsigned gpio, int value) -{ - return rdc_gpio_direction_output(gpio, value); -} - -static inline int gpio_get_value(unsigned gpio) -{ - return rdc_gpio_get_value(gpio); -} - -static inline void gpio_set_value(unsigned gpio, int value) -{ - rdc_gpio_set_value(gpio, value); -} - -static inline int gpio_to_irq(unsigned gpio) -{ - return gpio; -} - -static inline int irq_to_gpio(unsigned irq) -{ - return irq; -} - -/* For cansleep */ -#include - -#endif /* _ASM_X86_MACH_RDC321X_GPIO_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach-rdc321x/rdc321x_defs.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-rdc321x/rdc321x_defs.h +++ /dev/null @@ -1,12 +0,0 @@ -#define PFX "rdc321x: " - -/* General purpose configuration and data registers */ -#define RDC3210_CFGREG_ADDR 0x0CF8 -#define RDC3210_CFGREG_DATA 0x0CFC - -#define RDC321X_GPIO_CTRL_REG1 0x48 -#define RDC321X_GPIO_CTRL_REG2 0x84 -#define RDC321X_GPIO_DATA_REG1 0x4c -#define RDC321X_GPIO_DATA_REG2 0x88 - -#define RDC321X_MAX_GPIO 58 Index: linux-2.6-tip/arch/x86/include/asm/mach-voyager/do_timer.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-voyager/do_timer.h +++ /dev/null @@ -1,17 +0,0 @@ -/* defines for inline arch setup functions */ -#include - -#include -#include - -/** - * do_timer_interrupt_hook - hook into timer tick - * - * Call the pit clock event handler. see asm/i8253.h - **/ -static inline void do_timer_interrupt_hook(void) -{ - global_clock_event->event_handler(global_clock_event); - voyager_timer_interrupt(); -} - Index: linux-2.6-tip/arch/x86/include/asm/mach-voyager/entry_arch.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-voyager/entry_arch.h +++ /dev/null @@ -1,26 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8 -*- */ - -/* Copyright (C) 2002 - * - * Author: James.Bottomley@HansenPartnership.com - * - * linux/arch/i386/voyager/entry_arch.h - * - * This file builds the VIC and QIC CPI gates - */ - -/* initialise the voyager interrupt gates - * - * This uses the macros in irq.h to set up assembly jump gates. The - * calls are then redirected to the same routine with smp_ prefixed */ -BUILD_INTERRUPT(vic_sys_interrupt, VIC_SYS_INT) -BUILD_INTERRUPT(vic_cmn_interrupt, VIC_CMN_INT) -BUILD_INTERRUPT(vic_cpi_interrupt, VIC_CPI_LEVEL0); - -/* do all the QIC interrupts */ -BUILD_INTERRUPT(qic_timer_interrupt, QIC_TIMER_CPI); -BUILD_INTERRUPT(qic_invalidate_interrupt, QIC_INVALIDATE_CPI); -BUILD_INTERRUPT(qic_reschedule_interrupt, QIC_RESCHEDULE_CPI); -BUILD_INTERRUPT(qic_enable_irq_interrupt, QIC_ENABLE_IRQ_CPI); -BUILD_INTERRUPT(qic_call_function_interrupt, QIC_CALL_FUNCTION_CPI); -BUILD_INTERRUPT(qic_call_function_single_interrupt, QIC_CALL_FUNCTION_SINGLE_CPI); Index: linux-2.6-tip/arch/x86/include/asm/mach-voyager/setup_arch.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mach-voyager/setup_arch.h +++ /dev/null @@ -1,12 +0,0 @@ -#include -#include -#define VOYAGER_BIOS_INFO ((struct voyager_bios_info *) \ - (&boot_params.apm_bios_info)) - -/* Hook to call BIOS initialisation function */ - -/* for voyager, pass the voyager BIOS/SUS info area to the detection - * routines */ - -#define ARCH_SETUP voyager_detect(VOYAGER_BIOS_INFO); - Index: linux-2.6-tip/arch/x86/include/asm/mach_timer.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/mach_timer.h @@ -0,0 +1,48 @@ +/* + * Machine specific calibrate_tsc() for generic. + * Split out from timer_tsc.c by Osamu Tomita + */ +/* ------ Calibrate the TSC ------- + * Return 2^32 * (1 / (TSC clocks per usec)) for do_fast_gettimeoffset(). + * Too much 64-bit arithmetic here to do this cleanly in C, and for + * accuracy's sake we want to keep the overhead on the CTC speaker (channel 2) + * output busy loop as low as possible. We avoid reading the CTC registers + * directly because of the awkward 8-bit access mechanism of the 82C54 + * device. + */ +#ifndef _ASM_X86_MACH_DEFAULT_MACH_TIMER_H +#define _ASM_X86_MACH_DEFAULT_MACH_TIMER_H + +#define CALIBRATE_TIME_MSEC 30 /* 30 msecs */ +#define CALIBRATE_LATCH \ + ((CLOCK_TICK_RATE * CALIBRATE_TIME_MSEC + 1000/2)/1000) + +static inline void mach_prepare_counter(void) +{ + /* Set the Gate high, disable speaker */ + outb((inb(0x61) & ~0x02) | 0x01, 0x61); + + /* + * Now let's take care of CTC channel 2 + * + * Set the Gate high, program CTC channel 2 for mode 0, + * (interrupt on terminal count mode), binary count, + * load 5 * LATCH count, (LSB and MSB) to begin countdown. + * + * Some devices need a delay here. + */ + outb(0xb0, 0x43); /* binary, mode 0, LSB/MSB, Ch 2 */ + outb_p(CALIBRATE_LATCH & 0xff, 0x42); /* LSB of count */ + outb_p(CALIBRATE_LATCH >> 8, 0x42); /* MSB of count */ +} + +static inline void mach_countup(unsigned long *count_p) +{ + unsigned long count = 0; + do { + count++; + } while ((inb_p(0x61) & 0x20) == 0); + *count_p = count; +} + +#endif /* _ASM_X86_MACH_DEFAULT_MACH_TIMER_H */ Index: linux-2.6-tip/arch/x86/include/asm/mach_traps.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/mach_traps.h @@ -0,0 +1,33 @@ +/* + * Machine specific NMI handling for generic. + * Split out from traps.c by Osamu Tomita + */ +#ifndef _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H +#define _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H + +#include + +static inline unsigned char get_nmi_reason(void) +{ + return inb(0x61); +} + +static inline void reassert_nmi(void) +{ + int old_reg = -1; + + if (do_i_have_lock_cmos()) + old_reg = current_lock_cmos_reg(); + else + lock_cmos(0); /* register doesn't matter here */ + outb(0x8f, 0x70); + inb(0x71); /* dummy */ + outb(0x0f, 0x70); + inb(0x71); /* dummy */ + if (old_reg >= 0) + outb(old_reg, 0x70); + else + unlock_cmos(); +} + +#endif /* _ASM_X86_MACH_DEFAULT_MACH_TRAPS_H */ Index: linux-2.6-tip/arch/x86/include/asm/mce.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mce.h +++ linux-2.6-tip/arch/x86/include/asm/mce.h @@ -120,8 +120,6 @@ extern void mcheck_init(struct cpuinfo_x #else #define mcheck_init(c) do { } while (0) #endif -extern void stop_mce(void); -extern void restart_mce(void); #endif /* __KERNEL__ */ #endif /* _ASM_X86_MCE_H */ Index: linux-2.6-tip/arch/x86/include/asm/mmu_context.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mmu_context.h +++ linux-2.6-tip/arch/x86/include/asm/mmu_context.h @@ -21,11 +21,54 @@ static inline void paravirt_activate_mm( int init_new_context(struct task_struct *tsk, struct mm_struct *mm); void destroy_context(struct mm_struct *mm); -#ifdef CONFIG_X86_32 -# include "mmu_context_32.h" -#else -# include "mmu_context_64.h" + +static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +{ +#ifdef CONFIG_SMP + if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) + percpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); +#endif +} + +static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) +{ + unsigned cpu = smp_processor_id(); + + if (likely(prev != next)) { + /* stop flush ipis for the previous mm */ + cpu_clear(cpu, prev->cpu_vm_mask); +#ifdef CONFIG_SMP + percpu_write(cpu_tlbstate.state, TLBSTATE_OK); + percpu_write(cpu_tlbstate.active_mm, next); #endif + cpu_set(cpu, next->cpu_vm_mask); + + /* Re-load page tables */ + load_cr3(next->pgd); + + /* + * load the LDT, if the LDT is different: + */ + if (unlikely(prev->context.ldt != next->context.ldt)) + load_LDT_nolock(&next->context); + } +#ifdef CONFIG_SMP + else { + percpu_write(cpu_tlbstate.state, TLBSTATE_OK); + BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next); + + if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) { + /* We were in lazy tlb mode and leave_mm disabled + * tlb flush IPI delivery. We must reload CR3 + * to make sure to use no freed page tables. + */ + load_cr3(next->pgd); + load_LDT_nolock(&next->context); + } + } +#endif +} #define activate_mm(prev, next) \ do { \ @@ -33,5 +76,17 @@ do { \ switch_mm((prev), (next), NULL); \ } while (0); +#ifdef CONFIG_X86_32 +#define deactivate_mm(tsk, mm) \ +do { \ + lazy_load_gs(0); \ +} while (0) +#else +#define deactivate_mm(tsk, mm) \ +do { \ + load_gs_index(0); \ + loadsegment(fs, 0); \ +} while (0) +#endif #endif /* _ASM_X86_MMU_CONTEXT_H */ Index: linux-2.6-tip/arch/x86/include/asm/mmu_context_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mmu_context_32.h +++ /dev/null @@ -1,55 +0,0 @@ -#ifndef _ASM_X86_MMU_CONTEXT_32_H -#define _ASM_X86_MMU_CONTEXT_32_H - -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -#ifdef CONFIG_SMP - if (x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_OK) - x86_write_percpu(cpu_tlbstate.state, TLBSTATE_LAZY); -#endif -} - -static inline void switch_mm(struct mm_struct *prev, - struct mm_struct *next, - struct task_struct *tsk) -{ - int cpu = smp_processor_id(); - - if (likely(prev != next)) { - /* stop flush ipis for the previous mm */ - cpu_clear(cpu, prev->cpu_vm_mask); -#ifdef CONFIG_SMP - x86_write_percpu(cpu_tlbstate.state, TLBSTATE_OK); - x86_write_percpu(cpu_tlbstate.active_mm, next); -#endif - cpu_set(cpu, next->cpu_vm_mask); - - /* Re-load page tables */ - load_cr3(next->pgd); - - /* - * load the LDT, if the LDT is different: - */ - if (unlikely(prev->context.ldt != next->context.ldt)) - load_LDT_nolock(&next->context); - } -#ifdef CONFIG_SMP - else { - x86_write_percpu(cpu_tlbstate.state, TLBSTATE_OK); - BUG_ON(x86_read_percpu(cpu_tlbstate.active_mm) != next); - - if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) { - /* We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload %cr3. - */ - load_cr3(next->pgd); - load_LDT_nolock(&next->context); - } - } -#endif -} - -#define deactivate_mm(tsk, mm) \ - asm("movl %0,%%gs": :"r" (0)); - -#endif /* _ASM_X86_MMU_CONTEXT_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/mmu_context_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mmu_context_64.h +++ /dev/null @@ -1,54 +0,0 @@ -#ifndef _ASM_X86_MMU_CONTEXT_64_H -#define _ASM_X86_MMU_CONTEXT_64_H - -#include - -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -#ifdef CONFIG_SMP - if (read_pda(mmu_state) == TLBSTATE_OK) - write_pda(mmu_state, TLBSTATE_LAZY); -#endif -} - -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) -{ - unsigned cpu = smp_processor_id(); - if (likely(prev != next)) { - /* stop flush ipis for the previous mm */ - cpu_clear(cpu, prev->cpu_vm_mask); -#ifdef CONFIG_SMP - write_pda(mmu_state, TLBSTATE_OK); - write_pda(active_mm, next); -#endif - cpu_set(cpu, next->cpu_vm_mask); - load_cr3(next->pgd); - - if (unlikely(next->context.ldt != prev->context.ldt)) - load_LDT_nolock(&next->context); - } -#ifdef CONFIG_SMP - else { - write_pda(mmu_state, TLBSTATE_OK); - if (read_pda(active_mm) != next) - BUG(); - if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) { - /* We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload CR3 - * to make sure to use no freed page tables. - */ - load_cr3(next->pgd); - load_LDT_nolock(&next->context); - } - } -#endif -} - -#define deactivate_mm(tsk, mm) \ -do { \ - load_gs_index(0); \ - asm volatile("movl %0,%%fs"::"r"(0)); \ -} while (0) - -#endif /* _ASM_X86_MMU_CONTEXT_64_H */ Index: linux-2.6-tip/arch/x86/include/asm/mpspec.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mpspec.h +++ linux-2.6-tip/arch/x86/include/asm/mpspec.h @@ -9,7 +9,18 @@ extern int apic_version[MAX_APICS]; extern int pic_mode; #ifdef CONFIG_X86_32 -#include + +/* + * Summit or generic (i.e. installer) kernels need lots of bus entries. + * Maximum 256 PCI busses, plus 1 ISA bus in each of 4 cabinets. + */ +#if CONFIG_BASE_SMALL == 0 +# define MAX_MP_BUSSES 260 +#else +# define MAX_MP_BUSSES 32 +#endif + +#define MAX_IRQ_SOURCES 256 extern unsigned int def_to_bigsmp; extern u8 apicid_2_node[]; @@ -20,15 +31,15 @@ extern int mp_bus_id_to_local[MAX_MP_BUS extern int quad_local_to_mp_bus_id [NR_CPUS/4][4]; #endif -#define MAX_APICID 256 +#define MAX_APICID 256 -#else +#else /* CONFIG_X86_64: */ -#define MAX_MP_BUSSES 256 +#define MAX_MP_BUSSES 256 /* Each PCI slot may be a combo card with its own bus. 4 IRQ pins per slot. */ -#define MAX_IRQ_SOURCES (MAX_MP_BUSSES * 4) +#define MAX_IRQ_SOURCES (MAX_MP_BUSSES * 4) -#endif +#endif /* CONFIG_X86_64 */ extern void early_find_smp_config(void); extern void early_get_smp_config(void); @@ -45,11 +56,13 @@ extern int smp_found_config; extern int mpc_default_type; extern unsigned long mp_lapic_addr; -extern void find_smp_config(void); extern void get_smp_config(void); + #ifdef CONFIG_X86_MPPARSE +extern void find_smp_config(void); extern void early_reserve_e820_mpc_new(void); #else +static inline void find_smp_config(void) { } static inline void early_reserve_e820_mpc_new(void) { } #endif @@ -64,6 +77,8 @@ extern int acpi_probe_gsi(void); #ifdef CONFIG_X86_IO_APIC extern int mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin, u32 gsi, int triggering, int polarity); +extern int mp_find_ioapic(int gsi); +extern int mp_find_ioapic_pin(int ioapic, int gsi); #else static inline int mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin, @@ -148,4 +163,8 @@ static inline void physid_set_mask_of_ph extern physid_mask_t phys_cpu_present_map; +extern int generic_mps_oem_check(struct mpc_table *, char *, char *); + +extern int default_acpi_madt_oem_check(char *, char *); + #endif /* _ASM_X86_MPSPEC_H */ Index: linux-2.6-tip/arch/x86/include/asm/mpspec_def.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/mpspec_def.h +++ linux-2.6-tip/arch/x86/include/asm/mpspec_def.h @@ -24,17 +24,18 @@ # endif #endif -struct intel_mp_floating { - char mpf_signature[4]; /* "_MP_" */ - unsigned int mpf_physptr; /* Configuration table address */ - unsigned char mpf_length; /* Our length (paragraphs) */ - unsigned char mpf_specification;/* Specification version */ - unsigned char mpf_checksum; /* Checksum (makes sum 0) */ - unsigned char mpf_feature1; /* Standard or configuration ? */ - unsigned char mpf_feature2; /* Bit7 set for IMCR|PIC */ - unsigned char mpf_feature3; /* Unused (0) */ - unsigned char mpf_feature4; /* Unused (0) */ - unsigned char mpf_feature5; /* Unused (0) */ +/* Intel MP Floating Pointer Structure */ +struct mpf_intel { + char signature[4]; /* "_MP_" */ + unsigned int physptr; /* Configuration table address */ + unsigned char length; /* Our length (paragraphs) */ + unsigned char specification; /* Specification version */ + unsigned char checksum; /* Checksum (makes sum 0) */ + unsigned char feature1; /* Standard or configuration ? */ + unsigned char feature2; /* Bit7 set for IMCR|PIC */ + unsigned char feature3; /* Unused (0) */ + unsigned char feature4; /* Unused (0) */ + unsigned char feature5; /* Unused (0) */ }; #define MPC_SIGNATURE "PCMP" Index: linux-2.6-tip/arch/x86/include/asm/numaq.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq.h +++ linux-2.6-tip/arch/x86/include/asm/numaq.h @@ -31,6 +31,8 @@ extern int found_numaq; extern int get_memcfg_numaq(void); +extern void *xquad_portio; + /* * SYS_CFG_DATA_PRIV_ADDR, struct eachquadmem, and struct sys_cfg_data are the */ Index: linux-2.6-tip/arch/x86/include/asm/numaq/apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq/apic.h +++ /dev/null @@ -1,142 +0,0 @@ -#ifndef __ASM_NUMAQ_APIC_H -#define __ASM_NUMAQ_APIC_H - -#include -#include -#include - -#define APIC_DFR_VALUE (APIC_DFR_CLUSTER) - -static inline const cpumask_t *target_cpus(void) -{ - return &CPU_MASK_ALL; -} - -#define NO_BALANCE_IRQ (1) -#define esr_disable (1) - -#define INT_DELIVERY_MODE dest_LowestPrio -#define INT_DEST_MODE 0 /* physical delivery on LOCAL quad */ - -static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) -{ - return physid_isset(apicid, bitmap); -} -static inline unsigned long check_apicid_present(int bit) -{ - return physid_isset(bit, phys_cpu_present_map); -} -#define apicid_cluster(apicid) (apicid & 0xF0) - -static inline int apic_id_registered(void) -{ - return 1; -} - -static inline void init_apic_ldr(void) -{ - /* Already done in NUMA-Q firmware */ -} - -static inline void setup_apic_routing(void) -{ - printk("Enabling APIC mode: %s. Using %d I/O APICs\n", - "NUMA-Q", nr_ioapics); -} - -/* - * Skip adding the timer int on secondary nodes, which causes - * a small but painful rift in the time-space continuum. - */ -static inline int multi_timer_check(int apic, int irq) -{ - return apic != 0 && irq == 0; -} - -static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_map) -{ - /* We don't have a good way to do this yet - hack */ - return physids_promote(0xFUL); -} - -/* Mapping from cpu number to logical apicid */ -extern u8 cpu_2_logical_apicid[]; -static inline int cpu_to_logical_apicid(int cpu) -{ - if (cpu >= nr_cpu_ids) - return BAD_APICID; - return (int)cpu_2_logical_apicid[cpu]; -} - -/* - * Supporting over 60 cpus on NUMA-Q requires a locality-dependent - * cpu to APIC ID relation to properly interact with the intelligent - * mode of the cluster controller. - */ -static inline int cpu_present_to_apicid(int mps_cpu) -{ - if (mps_cpu < 60) - return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3)); - else - return BAD_APICID; -} - -static inline int apicid_to_node(int logical_apicid) -{ - return logical_apicid >> 4; -} - -static inline physid_mask_t apicid_to_cpu_present(int logical_apicid) -{ - int node = apicid_to_node(logical_apicid); - int cpu = __ffs(logical_apicid & 0xf); - - return physid_mask_of_physid(cpu + 4*node); -} - -extern void *xquad_portio; - -static inline void setup_portio_remap(void) -{ - int num_quads = num_online_nodes(); - - if (num_quads <= 1) - return; - - printk("Remapping cross-quad port I/O for %d quads\n", num_quads); - xquad_portio = ioremap(XQUAD_PORTIO_BASE, num_quads*XQUAD_PORTIO_QUAD); - printk("xquad_portio vaddr 0x%08lx, len %08lx\n", - (u_long) xquad_portio, (u_long) num_quads*XQUAD_PORTIO_QUAD); -} - -static inline int check_phys_apicid_present(int boot_cpu_physical_apicid) -{ - return (1); -} - -static inline void enable_apic_mode(void) -{ -} - -/* - * We use physical apicids here, not logical, so just return the default - * physical broadcast to stop people from breaking us - */ -static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask) -{ - return (int) 0xF; -} - -static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - return (int) 0xF; -} - -/* No NUMA-Q box has a HT CPU, but it can't hurt to use the default code. */ -static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb) -{ - return cpuid_apic >> index_msb; -} - -#endif /* __ASM_NUMAQ_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/numaq/apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq/apicdef.h +++ /dev/null @@ -1,14 +0,0 @@ -#ifndef __ASM_NUMAQ_APICDEF_H -#define __ASM_NUMAQ_APICDEF_H - - -#define APIC_ID_MASK (0xF<<24) - -static inline unsigned get_apic_id(unsigned long x) -{ - return (((x)>>24)&0x0F); -} - -#define GET_APIC_ID(x) get_apic_id(x) - -#endif Index: linux-2.6-tip/arch/x86/include/asm/numaq/ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq/ipi.h +++ /dev/null @@ -1,22 +0,0 @@ -#ifndef __ASM_NUMAQ_IPI_H -#define __ASM_NUMAQ_IPI_H - -void send_IPI_mask_sequence(const struct cpumask *mask, int vector); -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector); - -static inline void send_IPI_mask(const struct cpumask *mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static inline void send_IPI_allbutself(int vector) -{ - send_IPI_mask_allbutself(cpu_online_mask, vector); -} - -static inline void send_IPI_all(int vector) -{ - send_IPI_mask(cpu_online_mask, vector); -} - -#endif /* __ASM_NUMAQ_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/numaq/mpparse.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq/mpparse.h +++ /dev/null @@ -1,6 +0,0 @@ -#ifndef __ASM_NUMAQ_MPPARSE_H -#define __ASM_NUMAQ_MPPARSE_H - -extern void numaq_mps_oem_check(struct mpc_table *, char *, char *); - -#endif /* __ASM_NUMAQ_MPPARSE_H */ Index: linux-2.6-tip/arch/x86/include/asm/numaq/wakecpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/numaq/wakecpu.h +++ /dev/null @@ -1,45 +0,0 @@ -#ifndef __ASM_NUMAQ_WAKECPU_H -#define __ASM_NUMAQ_WAKECPU_H - -/* This file copes with machines that wakeup secondary CPUs by NMIs */ - -#define TRAMPOLINE_PHYS_LOW (0x8) -#define TRAMPOLINE_PHYS_HIGH (0xa) - -/* We don't do anything here because we use NMI's to boot instead */ -static inline void wait_for_init_deassert(atomic_t *deassert) -{ -} - -/* - * Because we use NMIs rather than the INIT-STARTUP sequence to - * bootstrap the CPUs, the APIC may be in a weird state. Kick it. - */ -static inline void smp_callin_clear_local_apic(void) -{ - clear_local_APIC(); -} - -static inline void store_NMI_vector(unsigned short *high, unsigned short *low) -{ - printk("Storing NMI vector\n"); - *high = - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_HIGH)); - *low = - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_LOW)); -} - -static inline void restore_NMI_vector(unsigned short *high, unsigned short *low) -{ - printk("Restoring NMI vector\n"); - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_HIGH)) = - *high; - *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = - *low; -} - -static inline void inquire_remote_apic(int apicid) -{ -} - -#endif /* __ASM_NUMAQ_WAKECPU_H */ Index: linux-2.6-tip/arch/x86/include/asm/page.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/page.h +++ linux-2.6-tip/arch/x86/include/asm/page.h @@ -1,42 +1,11 @@ #ifndef _ASM_X86_PAGE_H #define _ASM_X86_PAGE_H -#include - -/* PAGE_SHIFT determines the page size */ -#define PAGE_SHIFT 12 -#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) -#define PAGE_MASK (~(PAGE_SIZE-1)) +#include #ifdef __KERNEL__ -#define __PHYSICAL_MASK ((phys_addr_t)(1ULL << __PHYSICAL_MASK_SHIFT) - 1) -#define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1) - -/* Cast PAGE_MASK to a signed type so that it is sign-extended if - virtual addresses are 32-bits but physical addresses are larger - (ie, 32-bit PAE). */ -#define PHYSICAL_PAGE_MASK (((signed long)PAGE_MASK) & __PHYSICAL_MASK) - -/* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */ -#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) - -/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */ -#define PTE_FLAGS_MASK (~PTE_PFN_MASK) - -#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT) -#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) - -#define HPAGE_SHIFT PMD_SHIFT -#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT) -#define HPAGE_MASK (~(HPAGE_SIZE - 1)) -#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) - -#define HUGE_MAX_HSTATE 2 - -#ifndef __ASSEMBLY__ -#include -#endif +#include #ifdef CONFIG_X86_64 #include @@ -44,38 +13,18 @@ #include #endif /* CONFIG_X86_64 */ -#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) - -#define VM_DATA_DEFAULT_FLAGS \ - (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ - VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) - - #ifndef __ASSEMBLY__ -typedef struct { pgdval_t pgd; } pgd_t; -typedef struct { pgprotval_t pgprot; } pgprot_t; - -extern int page_is_ram(unsigned long pagenr); -extern int devmem_is_allowed(unsigned long pagenr); -extern void map_devmem(unsigned long pfn, unsigned long size, - pgprot_t vma_prot); -extern void unmap_devmem(unsigned long pfn, unsigned long size, - pgprot_t vma_prot); - -extern unsigned long max_low_pfn_mapped; -extern unsigned long max_pfn_mapped; - struct page; static inline void clear_user_page(void *page, unsigned long vaddr, - struct page *pg) + struct page *pg) { clear_page(page); } static inline void copy_user_page(void *to, void *from, unsigned long vaddr, - struct page *topage) + struct page *topage) { copy_page(to, from); } @@ -84,99 +33,6 @@ static inline void copy_user_page(void * alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr) #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE -static inline pgd_t native_make_pgd(pgdval_t val) -{ - return (pgd_t) { val }; -} - -static inline pgdval_t native_pgd_val(pgd_t pgd) -{ - return pgd.pgd; -} - -#if PAGETABLE_LEVELS >= 3 -#if PAGETABLE_LEVELS == 4 -typedef struct { pudval_t pud; } pud_t; - -static inline pud_t native_make_pud(pmdval_t val) -{ - return (pud_t) { val }; -} - -static inline pudval_t native_pud_val(pud_t pud) -{ - return pud.pud; -} -#else /* PAGETABLE_LEVELS == 3 */ -#include - -static inline pudval_t native_pud_val(pud_t pud) -{ - return native_pgd_val(pud.pgd); -} -#endif /* PAGETABLE_LEVELS == 4 */ - -typedef struct { pmdval_t pmd; } pmd_t; - -static inline pmd_t native_make_pmd(pmdval_t val) -{ - return (pmd_t) { val }; -} - -static inline pmdval_t native_pmd_val(pmd_t pmd) -{ - return pmd.pmd; -} -#else /* PAGETABLE_LEVELS == 2 */ -#include - -static inline pmdval_t native_pmd_val(pmd_t pmd) -{ - return native_pgd_val(pmd.pud.pgd); -} -#endif /* PAGETABLE_LEVELS >= 3 */ - -static inline pte_t native_make_pte(pteval_t val) -{ - return (pte_t) { .pte = val }; -} - -static inline pteval_t native_pte_val(pte_t pte) -{ - return pte.pte; -} - -static inline pteval_t native_pte_flags(pte_t pte) -{ - return native_pte_val(pte) & PTE_FLAGS_MASK; -} - -#define pgprot_val(x) ((x).pgprot) -#define __pgprot(x) ((pgprot_t) { (x) } ) - -#ifdef CONFIG_PARAVIRT -#include -#else /* !CONFIG_PARAVIRT */ - -#define pgd_val(x) native_pgd_val(x) -#define __pgd(x) native_make_pgd(x) - -#ifndef __PAGETABLE_PUD_FOLDED -#define pud_val(x) native_pud_val(x) -#define __pud(x) native_make_pud(x) -#endif - -#ifndef __PAGETABLE_PMD_FOLDED -#define pmd_val(x) native_pmd_val(x) -#define __pmd(x) native_make_pmd(x) -#endif - -#define pte_val(x) native_pte_val(x) -#define pte_flags(x) native_pte_flags(x) -#define __pte(x) native_make_pte(x) - -#endif /* CONFIG_PARAVIRT */ - #define __pa(x) __phys_addr((unsigned long)(x)) #define __pa_nodebug(x) __phys_addr_nodebug((unsigned long)(x)) /* __pa_symbol should be used for C visible symbols. Index: linux-2.6-tip/arch/x86/include/asm/page_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/page_32.h +++ linux-2.6-tip/arch/x86/include/asm/page_32.h @@ -1,82 +1,14 @@ #ifndef _ASM_X86_PAGE_32_H #define _ASM_X86_PAGE_32_H -/* - * This handles the memory map. - * - * A __PAGE_OFFSET of 0xC0000000 means that the kernel has - * a virtual address space of one gigabyte, which limits the - * amount of physical memory you can use to about 950MB. - * - * If you want more physical memory than this then see the CONFIG_HIGHMEM4G - * and CONFIG_HIGHMEM64G options in the kernel configuration. - */ -#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL) - -#ifdef CONFIG_4KSTACKS -#define THREAD_ORDER 0 -#else -#define THREAD_ORDER 1 -#endif -#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) - -#define STACKFAULT_STACK 0 -#define DOUBLEFAULT_STACK 1 -#define NMI_STACK 0 -#define DEBUG_STACK 0 -#define MCE_STACK 0 -#define N_EXCEPTION_STACKS 1 - -#ifdef CONFIG_X86_PAE -/* 44=32+12, the limit we can fit into an unsigned long pfn */ -#define __PHYSICAL_MASK_SHIFT 44 -#define __VIRTUAL_MASK_SHIFT 32 -#define PAGETABLE_LEVELS 3 - -#ifndef __ASSEMBLY__ -typedef u64 pteval_t; -typedef u64 pmdval_t; -typedef u64 pudval_t; -typedef u64 pgdval_t; -typedef u64 pgprotval_t; - -typedef union { - struct { - unsigned long pte_low, pte_high; - }; - pteval_t pte; -} pte_t; -#endif /* __ASSEMBLY__ - */ -#else /* !CONFIG_X86_PAE */ -#define __PHYSICAL_MASK_SHIFT 32 -#define __VIRTUAL_MASK_SHIFT 32 -#define PAGETABLE_LEVELS 2 - -#ifndef __ASSEMBLY__ -typedef unsigned long pteval_t; -typedef unsigned long pmdval_t; -typedef unsigned long pudval_t; -typedef unsigned long pgdval_t; -typedef unsigned long pgprotval_t; - -typedef union { - pteval_t pte; - pteval_t pte_low; -} pte_t; - -#endif /* __ASSEMBLY__ */ -#endif /* CONFIG_X86_PAE */ +#include #ifndef __ASSEMBLY__ -typedef struct page *pgtable_t; -#endif #ifdef CONFIG_HUGETLB_PAGE #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #endif -#ifndef __ASSEMBLY__ #define __phys_addr_nodebug(x) ((x) - PAGE_OFFSET) #ifdef CONFIG_DEBUG_VIRTUAL extern unsigned long __phys_addr(unsigned long); @@ -89,23 +21,6 @@ extern unsigned long __phys_addr(unsigne #define pfn_valid(pfn) ((pfn) < max_mapnr) #endif /* CONFIG_FLATMEM */ -extern int nx_enabled; - -/* - * This much address space is reserved for vmalloc() and iomap() - * as well as fixmap mappings. - */ -extern unsigned int __VMALLOC_RESERVE; -extern int sysctl_legacy_va_layout; - -extern void find_low_pfn_range(void); -extern unsigned long init_memory_mapping(unsigned long start, - unsigned long end); -extern void initmem_init(unsigned long, unsigned long); -extern void free_initmem(void); -extern void setup_bootmem_allocator(void); - - #ifdef CONFIG_X86_USE_3DNOW #include Index: linux-2.6-tip/arch/x86/include/asm/page_32_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/page_32_types.h @@ -0,0 +1,60 @@ +#ifndef _ASM_X86_PAGE_32_DEFS_H +#define _ASM_X86_PAGE_32_DEFS_H + +#include + +/* + * This handles the memory map. + * + * A __PAGE_OFFSET of 0xC0000000 means that the kernel has + * a virtual address space of one gigabyte, which limits the + * amount of physical memory you can use to about 950MB. + * + * If you want more physical memory than this then see the CONFIG_HIGHMEM4G + * and CONFIG_HIGHMEM64G options in the kernel configuration. + */ +#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL) + +#ifdef CONFIG_4KSTACKS +#define THREAD_ORDER 0 +#else +#define THREAD_ORDER 1 +#endif +#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) + +#define STACKFAULT_STACK 0 +#define DOUBLEFAULT_STACK 1 +#define NMI_STACK 0 +#define DEBUG_STACK 0 +#define MCE_STACK 0 +#define N_EXCEPTION_STACKS 1 + +#ifdef CONFIG_X86_PAE +/* 44=32+12, the limit we can fit into an unsigned long pfn */ +#define __PHYSICAL_MASK_SHIFT 44 +#define __VIRTUAL_MASK_SHIFT 32 + +#else /* !CONFIG_X86_PAE */ +#define __PHYSICAL_MASK_SHIFT 32 +#define __VIRTUAL_MASK_SHIFT 32 +#endif /* CONFIG_X86_PAE */ + +#ifndef __ASSEMBLY__ + +/* + * This much address space is reserved for vmalloc() and iomap() + * as well as fixmap mappings. + */ +extern unsigned int __VMALLOC_RESERVE; +extern int sysctl_legacy_va_layout; + +extern void find_low_pfn_range(void); +extern unsigned long init_memory_mapping(unsigned long start, + unsigned long end); +extern void initmem_init(unsigned long, unsigned long); +extern void free_initmem(void); +extern void setup_bootmem_allocator(void); + +#endif /* !__ASSEMBLY__ */ + +#endif /* _ASM_X86_PAGE_32_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/page_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/page_64.h +++ linux-2.6-tip/arch/x86/include/asm/page_64.h @@ -1,105 +1,6 @@ #ifndef _ASM_X86_PAGE_64_H #define _ASM_X86_PAGE_64_H -#define PAGETABLE_LEVELS 4 - -#define THREAD_ORDER 1 -#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) -#define CURRENT_MASK (~(THREAD_SIZE - 1)) - -#define EXCEPTION_STACK_ORDER 0 -#define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER) - -#define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1) -#define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER) - -#define IRQSTACK_ORDER 2 -#define IRQSTACKSIZE (PAGE_SIZE << IRQSTACK_ORDER) - -#define STACKFAULT_STACK 1 -#define DOUBLEFAULT_STACK 2 -#define NMI_STACK 3 -#define DEBUG_STACK 4 -#define MCE_STACK 5 -#define N_EXCEPTION_STACKS 5 /* hw limit: 7 */ - -#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT) -#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1)) - -/* - * Set __PAGE_OFFSET to the most negative possible address + - * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a - * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's - * what Xen requires. - */ -#define __PAGE_OFFSET _AC(0xffff880000000000, UL) - -#define __PHYSICAL_START CONFIG_PHYSICAL_START -#define __KERNEL_ALIGN 0x200000 - -/* - * Make sure kernel is aligned to 2MB address. Catching it at compile - * time is better. Change your config file and compile the kernel - * for a 2MB aligned address (CONFIG_PHYSICAL_START) - */ -#if (CONFIG_PHYSICAL_START % __KERNEL_ALIGN) != 0 -#error "CONFIG_PHYSICAL_START must be a multiple of 2MB" -#endif - -#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) -#define __START_KERNEL_map _AC(0xffffffff80000000, UL) - -/* See Documentation/x86_64/mm.txt for a description of the memory map. */ -#define __PHYSICAL_MASK_SHIFT 46 -#define __VIRTUAL_MASK_SHIFT 48 - -/* - * Kernel image size is limited to 512 MB (see level2_kernel_pgt in - * arch/x86/kernel/head_64.S), and it is mapped here: - */ -#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024) -#define KERNEL_IMAGE_START _AC(0xffffffff80000000, UL) - -#ifndef __ASSEMBLY__ -void clear_page(void *page); -void copy_page(void *to, void *from); - -/* duplicated to the one in bootmem.h */ -extern unsigned long max_pfn; -extern unsigned long phys_base; - -extern unsigned long __phys_addr(unsigned long); -#define __phys_reloc_hide(x) (x) - -/* - * These are used to make use of C type-checking.. - */ -typedef unsigned long pteval_t; -typedef unsigned long pmdval_t; -typedef unsigned long pudval_t; -typedef unsigned long pgdval_t; -typedef unsigned long pgprotval_t; - -typedef struct page *pgtable_t; - -typedef struct { pteval_t pte; } pte_t; - -#define vmemmap ((struct page *)VMEMMAP_START) - -extern unsigned long init_memory_mapping(unsigned long start, - unsigned long end); - -extern void initmem_init(unsigned long start_pfn, unsigned long end_pfn); -extern void free_initmem(void); - -extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); -extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); - -#endif /* !__ASSEMBLY__ */ - -#ifdef CONFIG_FLATMEM -#define pfn_valid(pfn) ((pfn) < max_pfn) -#endif - +#include #endif /* _ASM_X86_PAGE_64_H */ Index: linux-2.6-tip/arch/x86/include/asm/page_64_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/page_64_types.h @@ -0,0 +1,98 @@ +#ifndef _ASM_X86_PAGE_64_DEFS_H +#define _ASM_X86_PAGE_64_DEFS_H + +#define THREAD_ORDER 1 +#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) +#define CURRENT_MASK (~(THREAD_SIZE - 1)) + +#define EXCEPTION_STACK_ORDER 0 +#define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER) + +#define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1) +#define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER) + +#define IRQ_STACK_ORDER 2 +#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER) + +#ifdef CONFIG_PREEMPT_RT +# define STACKFAULT_STACK 0 +# define DOUBLEFAULT_STACK 1 +# define NMI_STACK 2 +# define DEBUG_STACK 0 +# define MCE_STACK 3 +# define N_EXCEPTION_STACKS 3 /* hw limit: 7 */ +#else +# define STACKFAULT_STACK 1 +# define DOUBLEFAULT_STACK 2 +# define NMI_STACK 3 +# define DEBUG_STACK 4 +# define MCE_STACK 5 +# define N_EXCEPTION_STACKS 5 /* hw limit: 7 */ +#endif + +#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT) +#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1)) + +/* + * Set __PAGE_OFFSET to the most negative possible address + + * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a + * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's + * what Xen requires. + */ +#define __PAGE_OFFSET _AC(0xffff880000000000, UL) + +#define __PHYSICAL_START CONFIG_PHYSICAL_START +#define __KERNEL_ALIGN 0x200000 + +/* + * Make sure kernel is aligned to 2MB address. Catching it at compile + * time is better. Change your config file and compile the kernel + * for a 2MB aligned address (CONFIG_PHYSICAL_START) + */ +#if (CONFIG_PHYSICAL_START % __KERNEL_ALIGN) != 0 +#error "CONFIG_PHYSICAL_START must be a multiple of 2MB" +#endif + +#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) +#define __START_KERNEL_map _AC(0xffffffff80000000, UL) + +/* See Documentation/x86_64/mm.txt for a description of the memory map. */ +#define __PHYSICAL_MASK_SHIFT 46 +#define __VIRTUAL_MASK_SHIFT 48 + +/* + * Kernel image size is limited to 512 MB (see level2_kernel_pgt in + * arch/x86/kernel/head_64.S), and it is mapped here: + */ +#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024) +#define KERNEL_IMAGE_START _AC(0xffffffff80000000, UL) + +#ifndef __ASSEMBLY__ +void clear_page(void *page); +void copy_page(void *to, void *from); + +/* duplicated to the one in bootmem.h */ +extern unsigned long max_pfn; +extern unsigned long phys_base; + +extern unsigned long __phys_addr(unsigned long); +#define __phys_reloc_hide(x) (x) + +#define vmemmap ((struct page *)VMEMMAP_START) + +extern unsigned long init_memory_mapping(unsigned long start, + unsigned long end); + +extern void initmem_init(unsigned long start_pfn, unsigned long end_pfn); +extern void free_initmem(void); + +extern void init_extra_mapping_uc(unsigned long phys, unsigned long size); +extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); + +#endif /* !__ASSEMBLY__ */ + +#ifdef CONFIG_FLATMEM +#define pfn_valid(pfn) ((pfn) < max_pfn) +#endif + +#endif /* _ASM_X86_PAGE_64_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/page_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/page_types.h @@ -0,0 +1,57 @@ +#ifndef _ASM_X86_PAGE_DEFS_H +#define _ASM_X86_PAGE_DEFS_H + +#include + +/* PAGE_SHIFT determines the page size */ +#define PAGE_SHIFT 12 +#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) +#define PAGE_MASK (~(PAGE_SIZE-1)) + +#define __PHYSICAL_MASK ((phys_addr_t)(1ULL << __PHYSICAL_MASK_SHIFT) - 1) +#define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1) + +/* Cast PAGE_MASK to a signed type so that it is sign-extended if + virtual addresses are 32-bits but physical addresses are larger + (ie, 32-bit PAE). */ +#define PHYSICAL_PAGE_MASK (((signed long)PAGE_MASK) & __PHYSICAL_MASK) + +#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT) +#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) + +#define HPAGE_SHIFT PMD_SHIFT +#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT) +#define HPAGE_MASK (~(HPAGE_SIZE - 1)) +#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) + +#define HUGE_MAX_HSTATE 2 + +#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET) + +#define VM_DATA_DEFAULT_FLAGS \ + (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ + VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#ifdef CONFIG_X86_64 +#include +#else +#include +#endif /* CONFIG_X86_64 */ + +#ifndef __ASSEMBLY__ + +struct pgprot; + +extern int page_is_ram(unsigned long pagenr); +extern int devmem_is_allowed(unsigned long pagenr); +extern void map_devmem(unsigned long pfn, unsigned long size, + struct pgprot vma_prot); +extern void unmap_devmem(unsigned long pfn, unsigned long size, + struct pgprot vma_prot); + +extern unsigned long max_low_pfn_mapped; +extern unsigned long max_pfn_mapped; + +#endif /* !__ASSEMBLY__ */ + +#endif /* _ASM_X86_PAGE_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/paravirt.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/paravirt.h +++ linux-2.6-tip/arch/x86/include/asm/paravirt.h @@ -4,7 +4,7 @@ * para-virtualization: those hooks are defined here. */ #ifdef CONFIG_PARAVIRT -#include +#include #include /* Bitmask of what can be clobbered: usually at least eax. */ @@ -12,21 +12,38 @@ #define CLBR_EAX (1 << 0) #define CLBR_ECX (1 << 1) #define CLBR_EDX (1 << 2) +#define CLBR_EDI (1 << 3) -#ifdef CONFIG_X86_64 -#define CLBR_RSI (1 << 3) -#define CLBR_RDI (1 << 4) +#ifdef CONFIG_X86_32 +/* CLBR_ANY should match all regs platform has. For i386, that's just it */ +#define CLBR_ANY ((1 << 4) - 1) + +#define CLBR_ARG_REGS (CLBR_EAX | CLBR_EDX | CLBR_ECX) +#define CLBR_RET_REG (CLBR_EAX | CLBR_EDX) +#define CLBR_SCRATCH (0) +#else +#define CLBR_RAX CLBR_EAX +#define CLBR_RCX CLBR_ECX +#define CLBR_RDX CLBR_EDX +#define CLBR_RDI CLBR_EDI +#define CLBR_RSI (1 << 4) #define CLBR_R8 (1 << 5) #define CLBR_R9 (1 << 6) #define CLBR_R10 (1 << 7) #define CLBR_R11 (1 << 8) + #define CLBR_ANY ((1 << 9) - 1) + +#define CLBR_ARG_REGS (CLBR_RDI | CLBR_RSI | CLBR_RDX | \ + CLBR_RCX | CLBR_R8 | CLBR_R9) +#define CLBR_RET_REG (CLBR_RAX) +#define CLBR_SCRATCH (CLBR_R10 | CLBR_R11) + #include -#else -/* CLBR_ANY should match all regs platform has. For i386, that's just it */ -#define CLBR_ANY ((1 << 3) - 1) #endif /* X86_64 */ +#define CLBR_CALLEE_SAVE ((CLBR_ARG_REGS | CLBR_SCRATCH) & ~CLBR_RET_REG) + #ifndef __ASSEMBLY__ #include #include @@ -40,6 +57,14 @@ struct tss_struct; struct mm_struct; struct desc_struct; +/* + * Wrapper type for pointers to code which uses the non-standard + * calling convention. See PV_CALL_SAVE_REGS_THUNK below. + */ +struct paravirt_callee_save { + void *func; +}; + /* general info */ struct pv_info { unsigned int kernel_rpl; @@ -189,11 +214,15 @@ struct pv_irq_ops { * expected to use X86_EFLAGS_IF; all other bits * returned from save_fl are undefined, and may be ignored by * restore_fl. + * + * NOTE: These functions callers expect the callee to preserve + * more registers than the standard C calling convention. */ - unsigned long (*save_fl)(void); - void (*restore_fl)(unsigned long); - void (*irq_disable)(void); - void (*irq_enable)(void); + struct paravirt_callee_save save_fl; + struct paravirt_callee_save restore_fl; + struct paravirt_callee_save irq_disable; + struct paravirt_callee_save irq_enable; + void (*safe_halt)(void); void (*halt)(void); @@ -244,7 +273,8 @@ struct pv_mmu_ops { void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_single)(unsigned long addr); - void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm, + void (*flush_tlb_others)(const struct cpumask *cpus, + struct mm_struct *mm, unsigned long va); /* Hooks for allocating and freeing a pagetable top-level */ @@ -278,12 +308,11 @@ struct pv_mmu_ops { void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); - pteval_t (*pte_val)(pte_t); - pteval_t (*pte_flags)(pte_t); - pte_t (*make_pte)(pteval_t pte); + struct paravirt_callee_save pte_val; + struct paravirt_callee_save make_pte; - pgdval_t (*pgd_val)(pgd_t); - pgd_t (*make_pgd)(pgdval_t pgd); + struct paravirt_callee_save pgd_val; + struct paravirt_callee_save make_pgd; #if PAGETABLE_LEVELS >= 3 #ifdef CONFIG_X86_PAE @@ -298,12 +327,12 @@ struct pv_mmu_ops { void (*set_pud)(pud_t *pudp, pud_t pudval); - pmdval_t (*pmd_val)(pmd_t); - pmd_t (*make_pmd)(pmdval_t pmd); + struct paravirt_callee_save pmd_val; + struct paravirt_callee_save make_pmd; #if PAGETABLE_LEVELS == 4 - pudval_t (*pud_val)(pud_t); - pud_t (*make_pud)(pudval_t pud); + struct paravirt_callee_save pud_val; + struct paravirt_callee_save make_pud; void (*set_pgd)(pgd_t *pudp, pgd_t pgdval); #endif /* PAGETABLE_LEVELS == 4 */ @@ -311,6 +340,7 @@ struct pv_mmu_ops { #ifdef CONFIG_HIGHPTE void *(*kmap_atomic_pte)(struct page *page, enum km_type type); + void *(*kmap_atomic_pte_direct)(struct page *page, enum km_type type); #endif struct pv_lazy_ops lazy_mode; @@ -388,6 +418,8 @@ extern struct pv_lock_ops pv_lock_ops; asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":") unsigned paravirt_patch_nop(void); +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len); +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len); unsigned paravirt_patch_ignore(unsigned len); unsigned paravirt_patch_call(void *insnbuf, const void *target, u16 tgt_clobbers, @@ -479,25 +511,45 @@ int paravirt_disable_iospace(void); * makes sure the incoming and outgoing types are always correct. */ #ifdef CONFIG_X86_32 -#define PVOP_VCALL_ARGS unsigned long __eax, __edx, __ecx +#define PVOP_VCALL_ARGS \ + unsigned long __eax = __eax, __edx = __edx, __ecx = __ecx #define PVOP_CALL_ARGS PVOP_VCALL_ARGS + +#define PVOP_CALL_ARG1(x) "a" ((unsigned long)(x)) +#define PVOP_CALL_ARG2(x) "d" ((unsigned long)(x)) +#define PVOP_CALL_ARG3(x) "c" ((unsigned long)(x)) + #define PVOP_VCALL_CLOBBERS "=a" (__eax), "=d" (__edx), \ "=c" (__ecx) #define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS + +#define PVOP_VCALLEE_CLOBBERS "=a" (__eax), "=d" (__edx) +#define PVOP_CALLEE_CLOBBERS PVOP_VCALLEE_CLOBBERS + #define EXTRA_CLOBBERS #define VEXTRA_CLOBBERS -#else -#define PVOP_VCALL_ARGS unsigned long __edi, __esi, __edx, __ecx +#else /* CONFIG_X86_64 */ +#define PVOP_VCALL_ARGS \ + unsigned long __edi = __edi, __esi = __esi, \ + __edx = __edx, __ecx = __ecx #define PVOP_CALL_ARGS PVOP_VCALL_ARGS, __eax + +#define PVOP_CALL_ARG1(x) "D" ((unsigned long)(x)) +#define PVOP_CALL_ARG2(x) "S" ((unsigned long)(x)) +#define PVOP_CALL_ARG3(x) "d" ((unsigned long)(x)) +#define PVOP_CALL_ARG4(x) "c" ((unsigned long)(x)) + #define PVOP_VCALL_CLOBBERS "=D" (__edi), \ "=S" (__esi), "=d" (__edx), \ "=c" (__ecx) - #define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS, "=a" (__eax) +#define PVOP_VCALLEE_CLOBBERS "=a" (__eax) +#define PVOP_CALLEE_CLOBBERS PVOP_VCALLEE_CLOBBERS + #define EXTRA_CLOBBERS , "r8", "r9", "r10", "r11" #define VEXTRA_CLOBBERS , "rax", "r8", "r9", "r10", "r11" -#endif +#endif /* CONFIG_X86_32 */ #ifdef CONFIG_PARAVIRT_DEBUG #define PVOP_TEST_NULL(op) BUG_ON(op == NULL) @@ -505,10 +557,11 @@ int paravirt_disable_iospace(void); #define PVOP_TEST_NULL(op) ((void)op) #endif -#define __PVOP_CALL(rettype, op, pre, post, ...) \ +#define ____PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr, \ + pre, post, ...) \ ({ \ rettype __ret; \ - PVOP_CALL_ARGS; \ + PVOP_CALL_ARGS; \ PVOP_TEST_NULL(op); \ /* This is 32-bit specific, but is okay in 64-bit */ \ /* since this condition will never hold */ \ @@ -516,70 +569,113 @@ int paravirt_disable_iospace(void); asm volatile(pre \ paravirt_alt(PARAVIRT_CALL) \ post \ - : PVOP_CALL_CLOBBERS \ + : call_clbr \ : paravirt_type(op), \ - paravirt_clobber(CLBR_ANY), \ + paravirt_clobber(clbr), \ ##__VA_ARGS__ \ - : "memory", "cc" EXTRA_CLOBBERS); \ + : "memory", "cc" extra_clbr); \ __ret = (rettype)((((u64)__edx) << 32) | __eax); \ } else { \ asm volatile(pre \ paravirt_alt(PARAVIRT_CALL) \ post \ - : PVOP_CALL_CLOBBERS \ + : call_clbr \ : paravirt_type(op), \ - paravirt_clobber(CLBR_ANY), \ + paravirt_clobber(clbr), \ ##__VA_ARGS__ \ - : "memory", "cc" EXTRA_CLOBBERS); \ + : "memory", "cc" extra_clbr); \ __ret = (rettype)__eax; \ } \ __ret; \ }) -#define __PVOP_VCALL(op, pre, post, ...) \ + +#define __PVOP_CALL(rettype, op, pre, post, ...) \ + ____PVOP_CALL(rettype, op, CLBR_ANY, PVOP_CALL_CLOBBERS, \ + EXTRA_CLOBBERS, pre, post, ##__VA_ARGS__) + +#define __PVOP_CALLEESAVE(rettype, op, pre, post, ...) \ + ____PVOP_CALL(rettype, op.func, CLBR_RET_REG, \ + PVOP_CALLEE_CLOBBERS, , \ + pre, post, ##__VA_ARGS__) + + +#define ____PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...) \ ({ \ PVOP_VCALL_ARGS; \ PVOP_TEST_NULL(op); \ asm volatile(pre \ paravirt_alt(PARAVIRT_CALL) \ post \ - : PVOP_VCALL_CLOBBERS \ + : call_clbr \ : paravirt_type(op), \ - paravirt_clobber(CLBR_ANY), \ + paravirt_clobber(clbr), \ ##__VA_ARGS__ \ - : "memory", "cc" VEXTRA_CLOBBERS); \ + : "memory", "cc" extra_clbr); \ }) +#define __PVOP_VCALL(op, pre, post, ...) \ + ____PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS, \ + VEXTRA_CLOBBERS, \ + pre, post, ##__VA_ARGS__) + +#define __PVOP_VCALLEESAVE(rettype, op, pre, post, ...) \ + ____PVOP_CALL(rettype, op.func, CLBR_RET_REG, \ + PVOP_VCALLEE_CLOBBERS, , \ + pre, post, ##__VA_ARGS__) + + + #define PVOP_CALL0(rettype, op) \ __PVOP_CALL(rettype, op, "", "") #define PVOP_VCALL0(op) \ __PVOP_VCALL(op, "", "") +#define PVOP_CALLEE0(rettype, op) \ + __PVOP_CALLEESAVE(rettype, op, "", "") +#define PVOP_VCALLEE0(op) \ + __PVOP_VCALLEESAVE(op, "", "") + + #define PVOP_CALL1(rettype, op, arg1) \ - __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1))) + __PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1)) #define PVOP_VCALL1(op, arg1) \ - __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1))) + __PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1)) + +#define PVOP_CALLEE1(rettype, op, arg1) \ + __PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1)) +#define PVOP_VCALLEE1(op, arg1) \ + __PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1)) + #define PVOP_CALL2(rettype, op, arg1, arg2) \ - __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \ - "1" ((unsigned long)(arg2))) + __PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2)) #define PVOP_VCALL2(op, arg1, arg2) \ - __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \ - "1" ((unsigned long)(arg2))) + __PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2)) + +#define PVOP_CALLEE2(rettype, op, arg1, arg2) \ + __PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2)) +#define PVOP_VCALLEE2(op, arg1, arg2) \ + __PVOP_VCALLEESAVE(op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2)) + #define PVOP_CALL3(rettype, op, arg1, arg2, arg3) \ - __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \ - "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3))) + __PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3)) #define PVOP_VCALL3(op, arg1, arg2, arg3) \ - __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \ - "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3))) + __PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1), \ + PVOP_CALL_ARG2(arg2), PVOP_CALL_ARG3(arg3)) /* This is the only difference in x86_64. We can make it much simpler */ #ifdef CONFIG_X86_32 #define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \ __PVOP_CALL(rettype, op, \ "push %[_arg4];", "lea 4(%%esp),%%esp;", \ - "0" ((u32)(arg1)), "1" ((u32)(arg2)), \ - "2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4))) + PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2), \ + PVOP_CALL_ARG3(arg3), [_arg4] "mr" ((u32)(arg4))) #define PVOP_VCALL4(op, arg1, arg2, arg3, arg4) \ __PVOP_VCALL(op, \ "push %[_arg4];", "lea 4(%%esp),%%esp;", \ @@ -587,13 +683,13 @@ int paravirt_disable_iospace(void); "2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4))) #else #define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \ - __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \ - "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)), \ - "3"((unsigned long)(arg4))) + __PVOP_CALL(rettype, op, "", "", \ + PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2), \ + PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4)) #define PVOP_VCALL4(op, arg1, arg2, arg3, arg4) \ - __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \ - "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)), \ - "3"((unsigned long)(arg4))) + __PVOP_VCALL(op, "", "", \ + PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2), \ + PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4)) #endif static inline int paravirt_enabled(void) @@ -984,10 +1080,11 @@ static inline void __flush_tlb_single(un PVOP_VCALL1(pv_mmu_ops.flush_tlb_single, addr); } -static inline void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm, +static inline void flush_tlb_others(const struct cpumask *cpumask, + struct mm_struct *mm, unsigned long va) { - PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, &cpumask, mm, va); + PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, cpumask, mm, va); } static inline int paravirt_pgd_alloc(struct mm_struct *mm) @@ -1040,6 +1137,14 @@ static inline void *kmap_atomic_pte(stru ret = PVOP_CALL2(unsigned long, pv_mmu_ops.kmap_atomic_pte, page, type); return (void *)ret; } + +static inline void *kmap_atomic_pte_direct(struct page *page, enum km_type type) +{ + unsigned long ret; + ret = PVOP_CALL2(unsigned long, pv_mmu_ops.kmap_atomic_pte_direct, + page, type); + return (void *)ret; +} #endif static inline void pte_update(struct mm_struct *mm, unsigned long addr, @@ -1059,13 +1164,13 @@ static inline pte_t __pte(pteval_t val) pteval_t ret; if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, - pv_mmu_ops.make_pte, - val, (u64)val >> 32); + ret = PVOP_CALLEE2(pteval_t, + pv_mmu_ops.make_pte, + val, (u64)val >> 32); else - ret = PVOP_CALL1(pteval_t, - pv_mmu_ops.make_pte, - val); + ret = PVOP_CALLEE1(pteval_t, + pv_mmu_ops.make_pte, + val); return (pte_t) { .pte = ret }; } @@ -1075,29 +1180,12 @@ static inline pteval_t pte_val(pte_t pte pteval_t ret; if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_val, - pte.pte, (u64)pte.pte >> 32); - else - ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_val, - pte.pte); - - return ret; -} - -static inline pteval_t pte_flags(pte_t pte) -{ - pteval_t ret; - - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_flags, - pte.pte, (u64)pte.pte >> 32); + ret = PVOP_CALLEE2(pteval_t, pv_mmu_ops.pte_val, + pte.pte, (u64)pte.pte >> 32); else - ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_flags, - pte.pte); + ret = PVOP_CALLEE1(pteval_t, pv_mmu_ops.pte_val, + pte.pte); -#ifdef CONFIG_PARAVIRT_DEBUG - BUG_ON(ret & PTE_PFN_MASK); -#endif return ret; } @@ -1106,11 +1194,11 @@ static inline pgd_t __pgd(pgdval_t val) pgdval_t ret; if (sizeof(pgdval_t) > sizeof(long)) - ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.make_pgd, - val, (u64)val >> 32); + ret = PVOP_CALLEE2(pgdval_t, pv_mmu_ops.make_pgd, + val, (u64)val >> 32); else - ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.make_pgd, - val); + ret = PVOP_CALLEE1(pgdval_t, pv_mmu_ops.make_pgd, + val); return (pgd_t) { ret }; } @@ -1120,11 +1208,11 @@ static inline pgdval_t pgd_val(pgd_t pgd pgdval_t ret; if (sizeof(pgdval_t) > sizeof(long)) - ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.pgd_val, - pgd.pgd, (u64)pgd.pgd >> 32); + ret = PVOP_CALLEE2(pgdval_t, pv_mmu_ops.pgd_val, + pgd.pgd, (u64)pgd.pgd >> 32); else - ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.pgd_val, - pgd.pgd); + ret = PVOP_CALLEE1(pgdval_t, pv_mmu_ops.pgd_val, + pgd.pgd); return ret; } @@ -1188,11 +1276,11 @@ static inline pmd_t __pmd(pmdval_t val) pmdval_t ret; if (sizeof(pmdval_t) > sizeof(long)) - ret = PVOP_CALL2(pmdval_t, pv_mmu_ops.make_pmd, - val, (u64)val >> 32); + ret = PVOP_CALLEE2(pmdval_t, pv_mmu_ops.make_pmd, + val, (u64)val >> 32); else - ret = PVOP_CALL1(pmdval_t, pv_mmu_ops.make_pmd, - val); + ret = PVOP_CALLEE1(pmdval_t, pv_mmu_ops.make_pmd, + val); return (pmd_t) { ret }; } @@ -1202,11 +1290,11 @@ static inline pmdval_t pmd_val(pmd_t pmd pmdval_t ret; if (sizeof(pmdval_t) > sizeof(long)) - ret = PVOP_CALL2(pmdval_t, pv_mmu_ops.pmd_val, - pmd.pmd, (u64)pmd.pmd >> 32); + ret = PVOP_CALLEE2(pmdval_t, pv_mmu_ops.pmd_val, + pmd.pmd, (u64)pmd.pmd >> 32); else - ret = PVOP_CALL1(pmdval_t, pv_mmu_ops.pmd_val, - pmd.pmd); + ret = PVOP_CALLEE1(pmdval_t, pv_mmu_ops.pmd_val, + pmd.pmd); return ret; } @@ -1228,11 +1316,11 @@ static inline pud_t __pud(pudval_t val) pudval_t ret; if (sizeof(pudval_t) > sizeof(long)) - ret = PVOP_CALL2(pudval_t, pv_mmu_ops.make_pud, - val, (u64)val >> 32); + ret = PVOP_CALLEE2(pudval_t, pv_mmu_ops.make_pud, + val, (u64)val >> 32); else - ret = PVOP_CALL1(pudval_t, pv_mmu_ops.make_pud, - val); + ret = PVOP_CALLEE1(pudval_t, pv_mmu_ops.make_pud, + val); return (pud_t) { ret }; } @@ -1242,11 +1330,11 @@ static inline pudval_t pud_val(pud_t pud pudval_t ret; if (sizeof(pudval_t) > sizeof(long)) - ret = PVOP_CALL2(pudval_t, pv_mmu_ops.pud_val, - pud.pud, (u64)pud.pud >> 32); + ret = PVOP_CALLEE2(pudval_t, pv_mmu_ops.pud_val, + pud.pud, (u64)pud.pud >> 32); else - ret = PVOP_CALL1(pudval_t, pv_mmu_ops.pud_val, - pud.pud); + ret = PVOP_CALLEE1(pudval_t, pv_mmu_ops.pud_val, + pud.pud); return ret; } @@ -1374,9 +1462,10 @@ static inline void __set_fixmap(unsigned } void _paravirt_nop(void); -#define paravirt_nop ((void *)_paravirt_nop) +u32 _paravirt_ident_32(u32); +u64 _paravirt_ident_64(u64); -void paravirt_use_bytelocks(void); +#define paravirt_nop ((void *)_paravirt_nop) #ifdef CONFIG_SMP @@ -1426,12 +1515,37 @@ extern struct paravirt_patch_site __para __parainstructions_end[]; #ifdef CONFIG_X86_32 -#define PV_SAVE_REGS "pushl %%ecx; pushl %%edx;" -#define PV_RESTORE_REGS "popl %%edx; popl %%ecx" +#define PV_SAVE_REGS "pushl %ecx; pushl %edx;" +#define PV_RESTORE_REGS "popl %edx; popl %ecx;" + +/* save and restore all caller-save registers, except return value */ +#define PV_SAVE_ALL_CALLER_REGS "pushl %ecx;" +#define PV_RESTORE_ALL_CALLER_REGS "popl %ecx;" + #define PV_FLAGS_ARG "0" #define PV_EXTRA_CLOBBERS #define PV_VEXTRA_CLOBBERS #else +/* save and restore all caller-save registers, except return value */ +#define PV_SAVE_ALL_CALLER_REGS \ + "push %rcx;" \ + "push %rdx;" \ + "push %rsi;" \ + "push %rdi;" \ + "push %r8;" \ + "push %r9;" \ + "push %r10;" \ + "push %r11;" +#define PV_RESTORE_ALL_CALLER_REGS \ + "pop %r11;" \ + "pop %r10;" \ + "pop %r9;" \ + "pop %r8;" \ + "pop %rdi;" \ + "pop %rsi;" \ + "pop %rdx;" \ + "pop %rcx;" + /* We save some registers, but all of them, that's too much. We clobber all * caller saved registers but the argument parameter */ #define PV_SAVE_REGS "pushq %%rdi;" @@ -1441,52 +1555,76 @@ extern struct paravirt_patch_site __para #define PV_FLAGS_ARG "D" #endif +/* + * Generate a thunk around a function which saves all caller-save + * registers except for the return value. This allows C functions to + * be called from assembler code where fewer than normal registers are + * available. It may also help code generation around calls from C + * code if the common case doesn't use many registers. + * + * When a callee is wrapped in a thunk, the caller can assume that all + * arg regs and all scratch registers are preserved across the + * call. The return value in rax/eax will not be saved, even for void + * functions. + */ +#define PV_CALLEE_SAVE_REGS_THUNK(func) \ + extern typeof(func) __raw_callee_save_##func; \ + static void *__##func##__ __used = func; \ + \ + asm(".pushsection .text;" \ + "__raw_callee_save_" #func ": " \ + PV_SAVE_ALL_CALLER_REGS \ + "call " #func ";" \ + PV_RESTORE_ALL_CALLER_REGS \ + "ret;" \ + ".popsection") + +/* Get a reference to a callee-save function */ +#define PV_CALLEE_SAVE(func) \ + ((struct paravirt_callee_save) { __raw_callee_save_##func }) + +/* Promise that "func" already uses the right calling convention */ +#define __PV_IS_CALLEE_SAVE(func) \ + ((struct paravirt_callee_save) { func }) + static inline unsigned long __raw_local_save_flags(void) { unsigned long f; - asm volatile(paravirt_alt(PV_SAVE_REGS - PARAVIRT_CALL - PV_RESTORE_REGS) + asm volatile(paravirt_alt(PARAVIRT_CALL) : "=a"(f) : paravirt_type(pv_irq_ops.save_fl), paravirt_clobber(CLBR_EAX) - : "memory", "cc" PV_VEXTRA_CLOBBERS); + : "memory", "cc"); return f; } static inline void raw_local_irq_restore(unsigned long f) { - asm volatile(paravirt_alt(PV_SAVE_REGS - PARAVIRT_CALL - PV_RESTORE_REGS) + asm volatile(paravirt_alt(PARAVIRT_CALL) : "=a"(f) : PV_FLAGS_ARG(f), paravirt_type(pv_irq_ops.restore_fl), paravirt_clobber(CLBR_EAX) - : "memory", "cc" PV_EXTRA_CLOBBERS); + : "memory", "cc"); } static inline void raw_local_irq_disable(void) { - asm volatile(paravirt_alt(PV_SAVE_REGS - PARAVIRT_CALL - PV_RESTORE_REGS) + asm volatile(paravirt_alt(PARAVIRT_CALL) : : paravirt_type(pv_irq_ops.irq_disable), paravirt_clobber(CLBR_EAX) - : "memory", "eax", "cc" PV_EXTRA_CLOBBERS); + : "memory", "eax", "cc"); } static inline void raw_local_irq_enable(void) { - asm volatile(paravirt_alt(PV_SAVE_REGS - PARAVIRT_CALL - PV_RESTORE_REGS) + asm volatile(paravirt_alt(PARAVIRT_CALL) : : paravirt_type(pv_irq_ops.irq_enable), paravirt_clobber(CLBR_EAX) - : "memory", "eax", "cc" PV_EXTRA_CLOBBERS); + : "memory", "eax", "cc"); } static inline unsigned long __raw_local_irq_save(void) @@ -1529,33 +1667,49 @@ static inline unsigned long __raw_local_ .popsection +#define COND_PUSH(set, mask, reg) \ + .if ((~(set)) & mask); push %reg; .endif +#define COND_POP(set, mask, reg) \ + .if ((~(set)) & mask); pop %reg; .endif + #ifdef CONFIG_X86_64 -#define PV_SAVE_REGS \ - push %rax; \ - push %rcx; \ - push %rdx; \ - push %rsi; \ - push %rdi; \ - push %r8; \ - push %r9; \ - push %r10; \ - push %r11 -#define PV_RESTORE_REGS \ - pop %r11; \ - pop %r10; \ - pop %r9; \ - pop %r8; \ - pop %rdi; \ - pop %rsi; \ - pop %rdx; \ - pop %rcx; \ - pop %rax + +#define PV_SAVE_REGS(set) \ + COND_PUSH(set, CLBR_RAX, rax); \ + COND_PUSH(set, CLBR_RCX, rcx); \ + COND_PUSH(set, CLBR_RDX, rdx); \ + COND_PUSH(set, CLBR_RSI, rsi); \ + COND_PUSH(set, CLBR_RDI, rdi); \ + COND_PUSH(set, CLBR_R8, r8); \ + COND_PUSH(set, CLBR_R9, r9); \ + COND_PUSH(set, CLBR_R10, r10); \ + COND_PUSH(set, CLBR_R11, r11) +#define PV_RESTORE_REGS(set) \ + COND_POP(set, CLBR_R11, r11); \ + COND_POP(set, CLBR_R10, r10); \ + COND_POP(set, CLBR_R9, r9); \ + COND_POP(set, CLBR_R8, r8); \ + COND_POP(set, CLBR_RDI, rdi); \ + COND_POP(set, CLBR_RSI, rsi); \ + COND_POP(set, CLBR_RDX, rdx); \ + COND_POP(set, CLBR_RCX, rcx); \ + COND_POP(set, CLBR_RAX, rax) + #define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 8) #define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8) #define PARA_INDIRECT(addr) *addr(%rip) #else -#define PV_SAVE_REGS pushl %eax; pushl %edi; pushl %ecx; pushl %edx -#define PV_RESTORE_REGS popl %edx; popl %ecx; popl %edi; popl %eax +#define PV_SAVE_REGS(set) \ + COND_PUSH(set, CLBR_EAX, eax); \ + COND_PUSH(set, CLBR_EDI, edi); \ + COND_PUSH(set, CLBR_ECX, ecx); \ + COND_PUSH(set, CLBR_EDX, edx) +#define PV_RESTORE_REGS(set) \ + COND_POP(set, CLBR_EDX, edx); \ + COND_POP(set, CLBR_ECX, ecx); \ + COND_POP(set, CLBR_EDI, edi); \ + COND_POP(set, CLBR_EAX, eax) + #define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 4) #define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4) #define PARA_INDIRECT(addr) *%cs:addr @@ -1567,15 +1721,15 @@ static inline unsigned long __raw_local_ #define DISABLE_INTERRUPTS(clobbers) \ PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \ - PV_SAVE_REGS; \ + PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE); \ call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable); \ - PV_RESTORE_REGS;) \ + PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);) #define ENABLE_INTERRUPTS(clobbers) \ PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_enable), clobbers, \ - PV_SAVE_REGS; \ + PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE); \ call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \ - PV_RESTORE_REGS;) + PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);) #define USERGS_SYSRET32 \ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret32), \ @@ -1605,11 +1759,15 @@ static inline unsigned long __raw_local_ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \ swapgs) +/* + * Note: swapgs is very special, and in practise is either going to be + * implemented with a single "swapgs" instruction or something very + * special. Either way, we don't need to save any registers for + * it. + */ #define SWAPGS \ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \ - PV_SAVE_REGS; \ - call PARA_INDIRECT(pv_cpu_ops+PV_CPU_swapgs); \ - PV_RESTORE_REGS \ + call PARA_INDIRECT(pv_cpu_ops+PV_CPU_swapgs) \ ) #define GET_CR2_INTO_RCX \ Index: linux-2.6-tip/arch/x86/include/asm/pat.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pat.h +++ linux-2.6-tip/arch/x86/include/asm/pat.h @@ -5,10 +5,8 @@ #ifdef CONFIG_X86_PAT extern int pat_enabled; -extern void validate_pat_support(struct cpuinfo_x86 *c); #else static const int pat_enabled; -static inline void validate_pat_support(struct cpuinfo_x86 *c) { } #endif extern void pat_init(void); @@ -17,6 +15,4 @@ extern int reserve_memtype(u64 start, u6 unsigned long req_type, unsigned long *ret_type); extern int free_memtype(u64 start, u64 end); -extern void pat_disable(char *reason); - #endif /* _ASM_X86_PAT_H */ Index: linux-2.6-tip/arch/x86/include/asm/pci-functions.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pci-functions.h @@ -0,0 +1,19 @@ +/* + * PCI BIOS function numbering for conventional PCI BIOS + * systems + */ + +#define PCIBIOS_PCI_FUNCTION_ID 0xb1XX +#define PCIBIOS_PCI_BIOS_PRESENT 0xb101 +#define PCIBIOS_FIND_PCI_DEVICE 0xb102 +#define PCIBIOS_FIND_PCI_CLASS_CODE 0xb103 +#define PCIBIOS_GENERATE_SPECIAL_CYCLE 0xb106 +#define PCIBIOS_READ_CONFIG_BYTE 0xb108 +#define PCIBIOS_READ_CONFIG_WORD 0xb109 +#define PCIBIOS_READ_CONFIG_DWORD 0xb10a +#define PCIBIOS_WRITE_CONFIG_BYTE 0xb10b +#define PCIBIOS_WRITE_CONFIG_WORD 0xb10c +#define PCIBIOS_WRITE_CONFIG_DWORD 0xb10d +#define PCIBIOS_GET_ROUTING_OPTIONS 0xb10e +#define PCIBIOS_SET_PCI_HW_INT 0xb10f + Index: linux-2.6-tip/arch/x86/include/asm/pda.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pda.h +++ /dev/null @@ -1,137 +0,0 @@ -#ifndef _ASM_X86_PDA_H -#define _ASM_X86_PDA_H - -#ifndef __ASSEMBLY__ -#include -#include -#include -#include - -/* Per processor datastructure. %gs points to it while the kernel runs */ -struct x8664_pda { - struct task_struct *pcurrent; /* 0 Current process */ - unsigned long data_offset; /* 8 Per cpu data offset from linker - address */ - unsigned long kernelstack; /* 16 top of kernel stack for current */ - unsigned long oldrsp; /* 24 user rsp for system call */ - int irqcount; /* 32 Irq nesting counter. Starts -1 */ - unsigned int cpunumber; /* 36 Logical CPU number */ -#ifdef CONFIG_CC_STACKPROTECTOR - unsigned long stack_canary; /* 40 stack canary value */ - /* gcc-ABI: this canary MUST be at - offset 40!!! */ -#endif - char *irqstackptr; - short nodenumber; /* number of current node (32k max) */ - short in_bootmem; /* pda lives in bootmem */ - unsigned int __softirq_pending; - unsigned int __nmi_count; /* number of NMI on this CPUs */ - short mmu_state; - short isidle; - struct mm_struct *active_mm; - unsigned apic_timer_irqs; - unsigned irq0_irqs; - unsigned irq_resched_count; - unsigned irq_call_count; - unsigned irq_tlb_count; - unsigned irq_thermal_count; - unsigned irq_threshold_count; - unsigned irq_spurious_count; -} ____cacheline_aligned_in_smp; - -extern struct x8664_pda **_cpu_pda; -extern void pda_init(int); - -#define cpu_pda(i) (_cpu_pda[i]) - -/* - * There is no fast way to get the base address of the PDA, all the accesses - * have to mention %fs/%gs. So it needs to be done this Torvaldian way. - */ -extern void __bad_pda_field(void) __attribute__((noreturn)); - -/* - * proxy_pda doesn't actually exist, but tell gcc it is accessed for - * all PDA accesses so it gets read/write dependencies right. - */ -extern struct x8664_pda _proxy_pda; - -#define pda_offset(field) offsetof(struct x8664_pda, field) - -#define pda_to_op(op, field, val) \ -do { \ - typedef typeof(_proxy_pda.field) T__; \ - if (0) { T__ tmp__; tmp__ = (val); } /* type checking */ \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op "w %1,%%gs:%c2" : \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i"(pda_offset(field))); \ - break; \ - case 4: \ - asm(op "l %1,%%gs:%c2" : \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i" (pda_offset(field))); \ - break; \ - case 8: \ - asm(op "q %1,%%gs:%c2": \ - "+m" (_proxy_pda.field) : \ - "ri" ((T__)val), \ - "i"(pda_offset(field))); \ - break; \ - default: \ - __bad_pda_field(); \ - } \ -} while (0) - -#define pda_from_op(op, field) \ -({ \ - typeof(_proxy_pda.field) ret__; \ - switch (sizeof(_proxy_pda.field)) { \ - case 2: \ - asm(op "w %%gs:%c1,%0" : \ - "=r" (ret__) : \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - case 4: \ - asm(op "l %%gs:%c1,%0": \ - "=r" (ret__): \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - case 8: \ - asm(op "q %%gs:%c1,%0": \ - "=r" (ret__) : \ - "i" (pda_offset(field)), \ - "m" (_proxy_pda.field)); \ - break; \ - default: \ - __bad_pda_field(); \ - } \ - ret__; \ -}) - -#define read_pda(field) pda_from_op("mov", field) -#define write_pda(field, val) pda_to_op("mov", field, val) -#define add_pda(field, val) pda_to_op("add", field, val) -#define sub_pda(field, val) pda_to_op("sub", field, val) -#define or_pda(field, val) pda_to_op("or", field, val) - -/* This is not atomic against other CPUs -- CPU preemption needs to be off */ -#define test_and_clear_bit_pda(bit, field) \ -({ \ - int old__; \ - asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0" \ - : "=r" (old__), "+m" (_proxy_pda.field) \ - : "dIr" (bit), "i" (pda_offset(field)) : "memory");\ - old__; \ -}) - -#endif - -#define PDA_STACKOFFSET (5*8) - -#endif /* _ASM_X86_PDA_H */ Index: linux-2.6-tip/arch/x86/include/asm/percpu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/percpu.h +++ linux-2.6-tip/arch/x86/include/asm/percpu.h @@ -2,53 +2,12 @@ #define _ASM_X86_PERCPU_H #ifdef CONFIG_X86_64 -#include - -/* Same as asm-generic/percpu.h, except that we store the per cpu offset - in the PDA. Longer term the PDA and every per cpu variable - should be just put into a single section and referenced directly - from %gs */ - -#ifdef CONFIG_SMP -#include - -#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset) -#define __my_cpu_offset read_pda(data_offset) - -#define per_cpu_offset(x) (__per_cpu_offset(x)) - +#define __percpu_seg gs +#define __percpu_mov_op movq +#else +#define __percpu_seg fs +#define __percpu_mov_op movl #endif -#include - -DECLARE_PER_CPU(struct x8664_pda, pda); - -/* - * These are supposed to be implemented as a single instruction which - * operates on the per-cpu data base segment. x86-64 doesn't have - * that yet, so this is a fairly inefficient workaround for the - * meantime. The single instruction is atomic with respect to - * preemption and interrupts, so we need to explicitly disable - * interrupts here to achieve the same effect. However, because it - * can be used from within interrupt-disable/enable, we can't actually - * disable interrupts; disabling preemption is enough. - */ -#define x86_read_percpu(var) \ - ({ \ - typeof(per_cpu_var(var)) __tmp; \ - preempt_disable(); \ - __tmp = __get_cpu_var(var); \ - preempt_enable(); \ - __tmp; \ - }) - -#define x86_write_percpu(var, val) \ - do { \ - preempt_disable(); \ - __get_cpu_var(var) = (val); \ - preempt_enable(); \ - } while(0) - -#else /* CONFIG_X86_64 */ #ifdef __ASSEMBLY__ @@ -65,47 +24,48 @@ DECLARE_PER_CPU(struct x8664_pda, pda); * PER_CPU(cpu_gdt_descr, %ebx) */ #ifdef CONFIG_SMP -#define PER_CPU(var, reg) \ - movl %fs:per_cpu__##this_cpu_off, reg; \ +#define PER_CPU(var, reg) \ + __percpu_mov_op %__percpu_seg:per_cpu__this_cpu_off, reg; \ lea per_cpu__##var(reg), reg -#define PER_CPU_VAR(var) %fs:per_cpu__##var +#define PER_CPU_VAR(var) %__percpu_seg:per_cpu__##var #else /* ! SMP */ -#define PER_CPU(var, reg) \ - movl $per_cpu__##var, reg +#define PER_CPU(var, reg) \ + __percpu_mov_op $per_cpu__##var, reg #define PER_CPU_VAR(var) per_cpu__##var #endif /* SMP */ +#ifdef CONFIG_X86_64_SMP +#define INIT_PER_CPU_VAR(var) init_per_cpu__##var +#else +#define INIT_PER_CPU_VAR(var) per_cpu__##var +#endif + #else /* ...!ASSEMBLY */ +#include + +#ifdef CONFIG_SMP +#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x +#define __my_cpu_offset percpu_read(this_cpu_off) +#else +#define __percpu_arg(x) "%" #x +#endif + /* - * PER_CPU finds an address of a per-cpu variable. + * Initialized pointers to per-cpu variables needed for the boot + * processor need to use these macros to get the proper address + * offset from __per_cpu_load on SMP. * - * Args: - * var - variable name - * cpu - 32bit register containing the current CPU number - * - * The resulting address is stored in the "cpu" argument. - * - * Example: - * PER_CPU(cpu_gdt_descr, %ebx) + * There also must be an entry in vmlinux_64.lds.S */ -#ifdef CONFIG_SMP - -#define __my_cpu_offset x86_read_percpu(this_cpu_off) - -/* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */ -#define __percpu_seg "%%fs:" - -#else /* !SMP */ - -#define __percpu_seg "" - -#endif /* SMP */ - -#include +#define DECLARE_INIT_PER_CPU(var) \ + extern typeof(per_cpu_var(var)) init_per_cpu_var(var) -/* We can use this directly for local CPU (faster). */ -DECLARE_PER_CPU(unsigned long, this_cpu_off); +#ifdef CONFIG_X86_64_SMP +#define init_per_cpu_var(var) init_per_cpu__##var +#else +#define init_per_cpu_var(var) per_cpu_var(var) +#endif /* For arch-specific code, we can use direct single-insn ops (they * don't give an lvalue though). */ @@ -120,20 +80,25 @@ do { \ } \ switch (sizeof(var)) { \ case 1: \ - asm(op "b %1,"__percpu_seg"%0" \ + asm(op "b %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((T__)val)); \ break; \ case 2: \ - asm(op "w %1,"__percpu_seg"%0" \ + asm(op "w %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((T__)val)); \ break; \ case 4: \ - asm(op "l %1,"__percpu_seg"%0" \ + asm(op "l %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((T__)val)); \ break; \ + case 8: \ + asm(op "q %1,"__percpu_arg(0) \ + : "+m" (var) \ + : "re" ((T__)val)); \ + break; \ default: __bad_percpu_size(); \ } \ } while (0) @@ -143,17 +108,22 @@ do { \ typeof(var) ret__; \ switch (sizeof(var)) { \ case 1: \ - asm(op "b "__percpu_seg"%1,%0" \ + asm(op "b "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : "m" (var)); \ break; \ case 2: \ - asm(op "w "__percpu_seg"%1,%0" \ + asm(op "w "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : "m" (var)); \ break; \ case 4: \ - asm(op "l "__percpu_seg"%1,%0" \ + asm(op "l "__percpu_arg(1)",%0" \ + : "=r" (ret__) \ + : "m" (var)); \ + break; \ + case 8: \ + asm(op "q "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : "m" (var)); \ break; \ @@ -162,13 +132,30 @@ do { \ ret__; \ }) -#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var) -#define x86_write_percpu(var, val) percpu_to_op("mov", per_cpu__##var, val) -#define x86_add_percpu(var, val) percpu_to_op("add", per_cpu__##var, val) -#define x86_sub_percpu(var, val) percpu_to_op("sub", per_cpu__##var, val) -#define x86_or_percpu(var, val) percpu_to_op("or", per_cpu__##var, val) +#define percpu_read(var) percpu_from_op("mov", per_cpu__##var) +#define percpu_write(var, val) percpu_to_op("mov", per_cpu__##var, val) +#define percpu_add(var, val) percpu_to_op("add", per_cpu__##var, val) +#define percpu_sub(var, val) percpu_to_op("sub", per_cpu__##var, val) +#define percpu_and(var, val) percpu_to_op("and", per_cpu__##var, val) +#define percpu_or(var, val) percpu_to_op("or", per_cpu__##var, val) +#define percpu_xor(var, val) percpu_to_op("xor", per_cpu__##var, val) + +/* This is not atomic against other CPUs -- CPU preemption needs to be off */ +#define x86_test_and_clear_bit_percpu(bit, var) \ +({ \ + int old__; \ + asm volatile("btr %2,"__percpu_arg(1)"\n\tsbbl %0,%0" \ + : "=r" (old__), "+m" (per_cpu__##var) \ + : "dIr" (bit)); \ + old__; \ +}) + +#include + +/* We can use this directly for local CPU (faster). */ +DECLARE_PER_CPU(unsigned long, this_cpu_off); + #endif /* !__ASSEMBLY__ */ -#endif /* !CONFIG_X86_64 */ #ifdef CONFIG_SMP @@ -195,9 +182,9 @@ do { \ #define early_per_cpu_ptr(_name) (_name##_early_ptr) #define early_per_cpu_map(_name, _idx) (_name##_early_map[_idx]) #define early_per_cpu(_name, _cpu) \ - (early_per_cpu_ptr(_name) ? \ - early_per_cpu_ptr(_name)[_cpu] : \ - per_cpu(_name, _cpu)) + *(early_per_cpu_ptr(_name) ? \ + &early_per_cpu_ptr(_name)[_cpu] : \ + &per_cpu(_name, _cpu)) #else /* !CONFIG_SMP */ #define DEFINE_EARLY_PER_CPU(_type, _name, _initvalue) \ Index: linux-2.6-tip/arch/x86/include/asm/perf_counter.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/perf_counter.h @@ -0,0 +1,95 @@ +#ifndef _ASM_X86_PERF_COUNTER_H +#define _ASM_X86_PERF_COUNTER_H + +/* + * Performance counter hw details: + */ + +#define X86_PMC_MAX_GENERIC 8 +#define X86_PMC_MAX_FIXED 3 + +#define X86_PMC_IDX_GENERIC 0 +#define X86_PMC_IDX_FIXED 32 +#define X86_PMC_IDX_MAX 64 + +#define MSR_ARCH_PERFMON_PERFCTR0 0xc1 +#define MSR_ARCH_PERFMON_PERFCTR1 0xc2 + +#define MSR_ARCH_PERFMON_EVENTSEL0 0x186 +#define MSR_ARCH_PERFMON_EVENTSEL1 0x187 + +#define ARCH_PERFMON_EVENTSEL0_ENABLE (1 << 22) +#define ARCH_PERFMON_EVENTSEL_INT (1 << 20) +#define ARCH_PERFMON_EVENTSEL_OS (1 << 17) +#define ARCH_PERFMON_EVENTSEL_USR (1 << 16) + +/* + * Includes eventsel and unit mask as well: + */ +#define ARCH_PERFMON_EVENT_MASK 0xffff + +#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL 0x3c +#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK (0x00 << 8) +#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 0 +#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \ + (1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX)) + +#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6 + +/* + * Intel "Architectural Performance Monitoring" CPUID + * detection/enumeration details: + */ +union cpuid10_eax { + struct { + unsigned int version_id:8; + unsigned int num_counters:8; + unsigned int bit_width:8; + unsigned int mask_length:8; + } split; + unsigned int full; +}; + +union cpuid10_edx { + struct { + unsigned int num_counters_fixed:4; + unsigned int reserved:28; + } split; + unsigned int full; +}; + + +/* + * Fixed-purpose performance counters: + */ + +/* + * All 3 fixed-mode PMCs are configured via this single MSR: + */ +#define MSR_ARCH_PERFMON_FIXED_CTR_CTRL 0x38d + +/* + * The counts are available in three separate MSRs: + */ + +/* Instr_Retired.Any: */ +#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309 +#define X86_PMC_IDX_FIXED_INSTRUCTIONS (X86_PMC_IDX_FIXED + 0) + +/* CPU_CLK_Unhalted.Core: */ +#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a +#define X86_PMC_IDX_FIXED_CPU_CYCLES (X86_PMC_IDX_FIXED + 1) + +/* CPU_CLK_Unhalted.Ref: */ +#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b +#define X86_PMC_IDX_FIXED_BUS_CYCLES (X86_PMC_IDX_FIXED + 2) + +#ifdef CONFIG_PERF_COUNTERS +extern void init_hw_perf_counters(void); +extern void perf_counters_lapic_init(int nmi); +#else +static inline void init_hw_perf_counters(void) { } +static inline void perf_counters_lapic_init(int nmi) { } +#endif + +#endif /* _ASM_X86_PERF_COUNTER_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable-2level-defs.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable-2level-defs.h +++ /dev/null @@ -1,20 +0,0 @@ -#ifndef _ASM_X86_PGTABLE_2LEVEL_DEFS_H -#define _ASM_X86_PGTABLE_2LEVEL_DEFS_H - -#define SHARED_KERNEL_PMD 0 - -/* - * traditional i386 two-level paging structure: - */ - -#define PGDIR_SHIFT 22 -#define PTRS_PER_PGD 1024 - -/* - * the i386 is two-level, so we don't really have any - * PMD directory physically. - */ - -#define PTRS_PER_PTE 1024 - -#endif /* _ASM_X86_PGTABLE_2LEVEL_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable-2level.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable-2level.h +++ linux-2.6-tip/arch/x86/include/asm/pgtable-2level.h @@ -53,8 +53,6 @@ static inline pte_t native_ptep_get_and_ #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) #endif -#define pte_none(x) (!(x).pte_low) - /* * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken, * split up the 29 bits of offset into this range: Index: linux-2.6-tip/arch/x86/include/asm/pgtable-2level_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pgtable-2level_types.h @@ -0,0 +1,37 @@ +#ifndef _ASM_X86_PGTABLE_2LEVEL_DEFS_H +#define _ASM_X86_PGTABLE_2LEVEL_DEFS_H + +#ifndef __ASSEMBLY__ +#include + +typedef unsigned long pteval_t; +typedef unsigned long pmdval_t; +typedef unsigned long pudval_t; +typedef unsigned long pgdval_t; +typedef unsigned long pgprotval_t; + +typedef union { + pteval_t pte; + pteval_t pte_low; +} pte_t; +#endif /* !__ASSEMBLY__ */ + +#define SHARED_KERNEL_PMD 0 +#define PAGETABLE_LEVELS 2 + +/* + * traditional i386 two-level paging structure: + */ + +#define PGDIR_SHIFT 22 +#define PTRS_PER_PGD 1024 + + +/* + * the i386 is two-level, so we don't really have any + * PMD directory physically. + */ + +#define PTRS_PER_PTE 1024 + +#endif /* _ASM_X86_PGTABLE_2LEVEL_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable-3level-defs.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable-3level-defs.h +++ /dev/null @@ -1,28 +0,0 @@ -#ifndef _ASM_X86_PGTABLE_3LEVEL_DEFS_H -#define _ASM_X86_PGTABLE_3LEVEL_DEFS_H - -#ifdef CONFIG_PARAVIRT -#define SHARED_KERNEL_PMD (pv_info.shared_kernel_pmd) -#else -#define SHARED_KERNEL_PMD 1 -#endif - -/* - * PGDIR_SHIFT determines what a top-level page table entry can map - */ -#define PGDIR_SHIFT 30 -#define PTRS_PER_PGD 4 - -/* - * PMD_SHIFT determines the size of the area a middle-level - * page table can map - */ -#define PMD_SHIFT 21 -#define PTRS_PER_PMD 512 - -/* - * entries per page directory level - */ -#define PTRS_PER_PTE 512 - -#endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable-3level.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable-3level.h +++ linux-2.6-tip/arch/x86/include/asm/pgtable-3level.h @@ -18,21 +18,6 @@ printk("%s:%d: bad pgd %p(%016Lx).\n", \ __FILE__, __LINE__, &(e), pgd_val(e)) -static inline int pud_none(pud_t pud) -{ - return pud_val(pud) == 0; -} - -static inline int pud_bad(pud_t pud) -{ - return (pud_val(pud) & ~(PTE_PFN_MASK | _KERNPG_TABLE | _PAGE_USER)) != 0; -} - -static inline int pud_present(pud_t pud) -{ - return pud_val(pud) & _PAGE_PRESENT; -} - /* Rules for using set_pte: the pte being assigned *must* be * either not present or in a state where the hardware will * not attempt to update the pte. In places where this is @@ -120,15 +105,6 @@ static inline void pud_clear(pud_t *pudp write_cr3(pgd); } -#define pud_page(pud) pfn_to_page(pud_val(pud) >> PAGE_SHIFT) - -#define pud_page_vaddr(pud) ((unsigned long) __va(pud_val(pud) & PTE_PFN_MASK)) - - -/* Find an entry in the second-level page table.. */ -#define pmd_offset(pud, address) ((pmd_t *)pud_page_vaddr(*(pud)) + \ - pmd_index(address)) - #ifdef CONFIG_SMP static inline pte_t native_ptep_get_and_clear(pte_t *ptep) { @@ -145,17 +121,6 @@ static inline pte_t native_ptep_get_and_ #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) #endif -#define __HAVE_ARCH_PTE_SAME -static inline int pte_same(pte_t a, pte_t b) -{ - return a.pte_low == b.pte_low && a.pte_high == b.pte_high; -} - -static inline int pte_none(pte_t pte) -{ - return !pte.pte_low && !pte.pte_high; -} - /* * Bits 0, 6 and 7 are taken in the low part of the pte, * put the 32 bits of offset into the high part. Index: linux-2.6-tip/arch/x86/include/asm/pgtable-3level_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pgtable-3level_types.h @@ -0,0 +1,48 @@ +#ifndef _ASM_X86_PGTABLE_3LEVEL_DEFS_H +#define _ASM_X86_PGTABLE_3LEVEL_DEFS_H + +#ifndef __ASSEMBLY__ +#include + +typedef u64 pteval_t; +typedef u64 pmdval_t; +typedef u64 pudval_t; +typedef u64 pgdval_t; +typedef u64 pgprotval_t; + +typedef union { + struct { + unsigned long pte_low, pte_high; + }; + pteval_t pte; +} pte_t; +#endif /* !__ASSEMBLY__ */ + +#ifdef CONFIG_PARAVIRT +#define SHARED_KERNEL_PMD (pv_info.shared_kernel_pmd) +#else +#define SHARED_KERNEL_PMD 1 +#endif + +#define PAGETABLE_LEVELS 3 + +/* + * PGDIR_SHIFT determines what a top-level page table entry can map + */ +#define PGDIR_SHIFT 30 +#define PTRS_PER_PGD 4 + +/* + * PMD_SHIFT determines the size of the area a middle-level + * page table can map + */ +#define PMD_SHIFT 21 +#define PTRS_PER_PMD 512 + +/* + * entries per page directory level + */ +#define PTRS_PER_PTE 512 + + +#endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable.h +++ linux-2.6-tip/arch/x86/include/asm/pgtable.h @@ -1,164 +1,9 @@ #ifndef _ASM_X86_PGTABLE_H #define _ASM_X86_PGTABLE_H -#define FIRST_USER_ADDRESS 0 +#include -#define _PAGE_BIT_PRESENT 0 /* is present */ -#define _PAGE_BIT_RW 1 /* writeable */ -#define _PAGE_BIT_USER 2 /* userspace addressable */ -#define _PAGE_BIT_PWT 3 /* page write through */ -#define _PAGE_BIT_PCD 4 /* page cache disabled */ -#define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */ -#define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */ -#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ -#define _PAGE_BIT_PAT 7 /* on 4KB pages */ -#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ -#define _PAGE_BIT_UNUSED1 9 /* available for programmer */ -#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */ -#define _PAGE_BIT_UNUSED3 11 -#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1 -#define _PAGE_BIT_CPA_TEST _PAGE_BIT_UNUSED1 -#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ - -/* If _PAGE_BIT_PRESENT is clear, we use these: */ -/* - if the user mapped it with PROT_NONE; pte_present gives true */ -#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL -/* - set: nonlinear file mapping, saved PTE; unset:swap */ -#define _PAGE_BIT_FILE _PAGE_BIT_DIRTY - -#define _PAGE_PRESENT (_AT(pteval_t, 1) << _PAGE_BIT_PRESENT) -#define _PAGE_RW (_AT(pteval_t, 1) << _PAGE_BIT_RW) -#define _PAGE_USER (_AT(pteval_t, 1) << _PAGE_BIT_USER) -#define _PAGE_PWT (_AT(pteval_t, 1) << _PAGE_BIT_PWT) -#define _PAGE_PCD (_AT(pteval_t, 1) << _PAGE_BIT_PCD) -#define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED) -#define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY) -#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) -#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) -#define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1) -#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP) -#define _PAGE_UNUSED3 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED3) -#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT) -#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) -#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) -#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) -#define __HAVE_ARCH_PTE_SPECIAL - -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) -#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) -#else -#define _PAGE_NX (_AT(pteval_t, 0)) -#endif - -#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) -#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) - -#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ - _PAGE_ACCESSED | _PAGE_DIRTY) -#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ - _PAGE_DIRTY) - -/* Set of bits not changed in pte_modify */ -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ - _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY) - -#define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT) -#define _PAGE_CACHE_WB (0) -#define _PAGE_CACHE_WC (_PAGE_PWT) -#define _PAGE_CACHE_UC_MINUS (_PAGE_PCD) -#define _PAGE_CACHE_UC (_PAGE_PCD | _PAGE_PWT) - -#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED) -#define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ - _PAGE_ACCESSED | _PAGE_NX) - -#define PAGE_SHARED_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | \ - _PAGE_USER | _PAGE_ACCESSED) -#define PAGE_COPY_NOEXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ - _PAGE_ACCESSED | _PAGE_NX) -#define PAGE_COPY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ - _PAGE_ACCESSED) -#define PAGE_COPY PAGE_COPY_NOEXEC -#define PAGE_READONLY __pgprot(_PAGE_PRESENT | _PAGE_USER | \ - _PAGE_ACCESSED | _PAGE_NX) -#define PAGE_READONLY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ - _PAGE_ACCESSED) - -#define __PAGE_KERNEL_EXEC \ - (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL) -#define __PAGE_KERNEL (__PAGE_KERNEL_EXEC | _PAGE_NX) - -#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW) -#define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW) -#define __PAGE_KERNEL_EXEC_NOCACHE (__PAGE_KERNEL_EXEC | _PAGE_PCD | _PAGE_PWT) -#define __PAGE_KERNEL_WC (__PAGE_KERNEL | _PAGE_CACHE_WC) -#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT) -#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD) -#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER) -#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT) -#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE) -#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE) -#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE) - -#define __PAGE_KERNEL_IO (__PAGE_KERNEL | _PAGE_IOMAP) -#define __PAGE_KERNEL_IO_NOCACHE (__PAGE_KERNEL_NOCACHE | _PAGE_IOMAP) -#define __PAGE_KERNEL_IO_UC_MINUS (__PAGE_KERNEL_UC_MINUS | _PAGE_IOMAP) -#define __PAGE_KERNEL_IO_WC (__PAGE_KERNEL_WC | _PAGE_IOMAP) - -#define PAGE_KERNEL __pgprot(__PAGE_KERNEL) -#define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO) -#define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC) -#define PAGE_KERNEL_RX __pgprot(__PAGE_KERNEL_RX) -#define PAGE_KERNEL_WC __pgprot(__PAGE_KERNEL_WC) -#define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE) -#define PAGE_KERNEL_UC_MINUS __pgprot(__PAGE_KERNEL_UC_MINUS) -#define PAGE_KERNEL_EXEC_NOCACHE __pgprot(__PAGE_KERNEL_EXEC_NOCACHE) -#define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE) -#define PAGE_KERNEL_LARGE_NOCACHE __pgprot(__PAGE_KERNEL_LARGE_NOCACHE) -#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC) -#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL) -#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE) - -#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO) -#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE) -#define PAGE_KERNEL_IO_UC_MINUS __pgprot(__PAGE_KERNEL_IO_UC_MINUS) -#define PAGE_KERNEL_IO_WC __pgprot(__PAGE_KERNEL_IO_WC) - -/* xwr */ -#define __P000 PAGE_NONE -#define __P001 PAGE_READONLY -#define __P010 PAGE_COPY -#define __P011 PAGE_COPY -#define __P100 PAGE_READONLY_EXEC -#define __P101 PAGE_READONLY_EXEC -#define __P110 PAGE_COPY_EXEC -#define __P111 PAGE_COPY_EXEC - -#define __S000 PAGE_NONE -#define __S001 PAGE_READONLY -#define __S010 PAGE_SHARED -#define __S011 PAGE_SHARED -#define __S100 PAGE_READONLY_EXEC -#define __S101 PAGE_READONLY_EXEC -#define __S110 PAGE_SHARED_EXEC -#define __S111 PAGE_SHARED_EXEC - -/* - * early identity mapping pte attrib macros. - */ -#ifdef CONFIG_X86_64 -#define __PAGE_KERNEL_IDENT_LARGE_EXEC __PAGE_KERNEL_LARGE_EXEC -#else -/* - * For PDE_IDENT_ATTR include USER bit. As the PDE and PTE protection - * bits are combined, this will alow user to access the high address mapped - * VDSO in the presence of CONFIG_COMPAT_VDSO - */ -#define PTE_IDENT_ATTR 0x003 /* PRESENT+RW */ -#define PDE_IDENT_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */ -#define PGD_IDENT_ATTR 0x001 /* PRESENT (no other attributes) */ -#endif +#include /* * Macro to mark a page protection value as UC- @@ -170,9 +15,6 @@ #ifndef __ASSEMBLY__ -#define pgprot_writecombine pgprot_writecombine -extern pgprot_t pgprot_writecombine(pgprot_t prot); - /* * ZERO_PAGE is a global shared page that is always zero: used * for zero-mapped memory areas etc.. @@ -183,6 +25,66 @@ extern unsigned long empty_zero_page[PAG extern spinlock_t pgd_lock; extern struct list_head pgd_list; +#ifdef CONFIG_PARAVIRT +#include +#else /* !CONFIG_PARAVIRT */ +#define set_pte(ptep, pte) native_set_pte(ptep, pte) +#define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) + +#define set_pte_present(mm, addr, ptep, pte) \ + native_set_pte_present(mm, addr, ptep, pte) +#define set_pte_atomic(ptep, pte) \ + native_set_pte_atomic(ptep, pte) + +#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd) + +#ifndef __PAGETABLE_PUD_FOLDED +#define set_pgd(pgdp, pgd) native_set_pgd(pgdp, pgd) +#define pgd_clear(pgd) native_pgd_clear(pgd) +#endif + +#ifndef set_pud +# define set_pud(pudp, pud) native_set_pud(pudp, pud) +#endif + +#ifndef __PAGETABLE_PMD_FOLDED +#define pud_clear(pud) native_pud_clear(pud) +#endif + +#define pte_clear(mm, addr, ptep) native_pte_clear(mm, addr, ptep) +#define pmd_clear(pmd) native_pmd_clear(pmd) + +#define pte_update(mm, addr, ptep) do { } while (0) +#define pte_update_defer(mm, addr, ptep) do { } while (0) + +static inline void __init paravirt_pagetable_setup_start(pgd_t *base) +{ + native_pagetable_setup_start(base); +} + +static inline void __init paravirt_pagetable_setup_done(pgd_t *base) +{ + native_pagetable_setup_done(base); +} + +#define pgd_val(x) native_pgd_val(x) +#define __pgd(x) native_make_pgd(x) + +#ifndef __PAGETABLE_PUD_FOLDED +#define pud_val(x) native_pud_val(x) +#define __pud(x) native_make_pud(x) +#endif + +#ifndef __PAGETABLE_PMD_FOLDED +#define pmd_val(x) native_pmd_val(x) +#define __pmd(x) native_make_pmd(x) +#endif + +#define pte_val(x) native_pte_val(x) +#define __pte(x) native_make_pte(x) + +#endif /* CONFIG_PARAVIRT */ + /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -236,72 +138,84 @@ static inline unsigned long pte_pfn(pte_ static inline int pmd_large(pmd_t pte) { - return (pmd_val(pte) & (_PAGE_PSE | _PAGE_PRESENT)) == + return (pmd_flags(pte) & (_PAGE_PSE | _PAGE_PRESENT)) == (_PAGE_PSE | _PAGE_PRESENT); } +static inline pte_t pte_set_flags(pte_t pte, pteval_t set) +{ + pteval_t v = native_pte_val(pte); + + return native_make_pte(v | set); +} + +static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) +{ + pteval_t v = native_pte_val(pte); + + return native_make_pte(v & ~clear); +} + static inline pte_t pte_mkclean(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_DIRTY); } static inline pte_t pte_mkold(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_ACCESSED); + return pte_clear_flags(pte, _PAGE_ACCESSED); } static inline pte_t pte_wrprotect(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_RW); + return pte_clear_flags(pte, _PAGE_RW); } static inline pte_t pte_mkexec(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_NX); + return pte_clear_flags(pte, _PAGE_NX); } static inline pte_t pte_mkdirty(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_DIRTY); + return pte_set_flags(pte, _PAGE_DIRTY); } static inline pte_t pte_mkyoung(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_ACCESSED); + return pte_set_flags(pte, _PAGE_ACCESSED); } static inline pte_t pte_mkwrite(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_RW); + return pte_set_flags(pte, _PAGE_RW); } static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PSE); + return pte_set_flags(pte, _PAGE_PSE); } static inline pte_t pte_clrhuge(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_PSE); + return pte_clear_flags(pte, _PAGE_PSE); } static inline pte_t pte_mkglobal(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_GLOBAL); + return pte_set_flags(pte, _PAGE_GLOBAL); } static inline pte_t pte_clrglobal(pte_t pte) { - return __pte(pte_val(pte) & ~_PAGE_GLOBAL); + return pte_clear_flags(pte, _PAGE_GLOBAL); } static inline pte_t pte_mkspecial(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_SPECIAL); + return pte_set_flags(pte, _PAGE_SPECIAL); } -extern pteval_t __supported_pte_mask; - /* * Mask out unsupported bits in a present pgprot. Non-present pgprots * can use those bits for other purposes, so leave them be. @@ -374,82 +288,200 @@ static inline int is_new_memtype_allowed return 1; } -#ifndef __ASSEMBLY__ -/* Indicate that x86 has its own track and untrack pfn vma functions */ -#define __HAVE_PFNMAP_TRACKING - -#define __HAVE_PHYS_MEM_ACCESS_PROT -struct file; -pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, - unsigned long size, pgprot_t vma_prot); -int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, - unsigned long size, pgprot_t *vma_prot); -#endif - -/* Install a pte for a particular vaddr in kernel space. */ -void set_pte_vaddr(unsigned long vaddr, pte_t pte); +#endif /* __ASSEMBLY__ */ #ifdef CONFIG_X86_32 -extern void native_pagetable_setup_start(pgd_t *base); -extern void native_pagetable_setup_done(pgd_t *base); +# include "pgtable_32.h" #else -static inline void native_pagetable_setup_start(pgd_t *base) {} -static inline void native_pagetable_setup_done(pgd_t *base) {} +# include "pgtable_64.h" #endif -struct seq_file; -extern void arch_report_meminfo(struct seq_file *m); +#ifndef __ASSEMBLY__ +#include -#ifdef CONFIG_PARAVIRT -#include -#else /* !CONFIG_PARAVIRT */ -#define set_pte(ptep, pte) native_set_pte(ptep, pte) -#define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) +static inline int pte_none(pte_t pte) +{ + return !pte.pte; +} -#define set_pte_present(mm, addr, ptep, pte) \ - native_set_pte_present(mm, addr, ptep, pte) -#define set_pte_atomic(ptep, pte) \ - native_set_pte_atomic(ptep, pte) +#define __HAVE_ARCH_PTE_SAME +static inline int pte_same(pte_t a, pte_t b) +{ + return a.pte == b.pte; +} -#define set_pmd(pmdp, pmd) native_set_pmd(pmdp, pmd) +static inline int pte_present(pte_t a) +{ + return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); +} -#ifndef __PAGETABLE_PUD_FOLDED -#define set_pgd(pgdp, pgd) native_set_pgd(pgdp, pgd) -#define pgd_clear(pgd) native_pgd_clear(pgd) -#endif +static inline int pmd_present(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_PRESENT; +} -#ifndef set_pud -# define set_pud(pudp, pud) native_set_pud(pudp, pud) -#endif +static inline int pmd_none(pmd_t pmd) +{ + /* Only check low word on 32-bit platforms, since it might be + out of sync with upper half. */ + return (unsigned long)native_pmd_val(pmd) == 0; +} -#ifndef __PAGETABLE_PMD_FOLDED -#define pud_clear(pud) native_pud_clear(pud) -#endif +static inline unsigned long pmd_page_vaddr(pmd_t pmd) +{ + return (unsigned long)__va(pmd_val(pmd) & PTE_PFN_MASK); +} -#define pte_clear(mm, addr, ptep) native_pte_clear(mm, addr, ptep) -#define pmd_clear(pmd) native_pmd_clear(pmd) +/* + * Currently stuck as a macro due to indirect forward reference to + * linux/mmzone.h's __section_mem_map_addr() definition: + */ +#define pmd_page(pmd) pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT) -#define pte_update(mm, addr, ptep) do { } while (0) -#define pte_update_defer(mm, addr, ptep) do { } while (0) +/* + * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD] + * + * this macro returns the index of the entry in the pmd page which would + * control the given virtual address + */ +static inline unsigned pmd_index(unsigned long address) +{ + return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1); +} -static inline void __init paravirt_pagetable_setup_start(pgd_t *base) +/* + * Conversion functions: convert a page and protection to a page entry, + * and a page entry and page directory to the page they refer to. + * + * (Currently stuck as a macro because of indirect forward reference + * to linux/mm.h:page_to_nid()) + */ +#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) + +/* + * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE] + * + * this function returns the index of the entry in the pte page which would + * control the given virtual address + */ +static inline unsigned pte_index(unsigned long address) { - native_pagetable_setup_start(base); + return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); } -static inline void __init paravirt_pagetable_setup_done(pgd_t *base) +static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) { - native_pagetable_setup_done(base); + return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address); } -#endif /* CONFIG_PARAVIRT */ -#endif /* __ASSEMBLY__ */ +static inline int pmd_bad(pmd_t pmd) +{ + return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; +} -#ifdef CONFIG_X86_32 -# include "pgtable_32.h" +static inline unsigned long pages_to_mb(unsigned long npg) +{ + return npg >> (20 - PAGE_SHIFT); +} + +#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ + remap_pfn_range(vma, vaddr, pfn, size, prot) + +#if PAGETABLE_LEVELS > 2 +static inline int pud_none(pud_t pud) +{ + return native_pud_val(pud) == 0; +} + +static inline int pud_present(pud_t pud) +{ + return pud_flags(pud) & _PAGE_PRESENT; +} + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + return (unsigned long)__va((unsigned long)pud_val(pud) & PTE_PFN_MASK); +} + +/* + * Currently stuck as a macro due to indirect forward reference to + * linux/mmzone.h's __section_mem_map_addr() definition: + */ +#define pud_page(pud) pfn_to_page(pud_val(pud) >> PAGE_SHIFT) + +/* Find an entry in the second-level page table.. */ +static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +{ + return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); +} + +static inline unsigned long pmd_pfn(pmd_t pmd) +{ + return (pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT; +} + +static inline int pud_large(pud_t pud) +{ + return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) == + (_PAGE_PSE | _PAGE_PRESENT); +} + +static inline int pud_bad(pud_t pud) +{ + return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0; +} #else -# include "pgtable_64.h" -#endif +static inline int pud_large(pud_t pud) +{ + return 0; +} +#endif /* PAGETABLE_LEVELS > 2 */ + +#if PAGETABLE_LEVELS > 3 +static inline int pgd_present(pgd_t pgd) +{ + return pgd_flags(pgd) & _PAGE_PRESENT; +} + +static inline unsigned long pgd_page_vaddr(pgd_t pgd) +{ + return (unsigned long)__va((unsigned long)pgd_val(pgd) & PTE_PFN_MASK); +} + +/* + * Currently stuck as a macro due to indirect forward reference to + * linux/mmzone.h's __section_mem_map_addr() definition: + */ +#define pgd_page(pgd) pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT) + +/* to find an entry in a page-table-directory. */ +static inline unsigned pud_index(unsigned long address) +{ + return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1); +} + +static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address) +{ + return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address); +} + +static inline int pgd_bad(pgd_t pgd) +{ + return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE; +} + +static inline int pgd_none(pgd_t pgd) +{ + return !native_pgd_val(pgd); +} +#endif /* PAGETABLE_LEVELS > 3 */ + +static inline int pte_hidden(pte_t pte) +{ + return pte_flags(pte) & _PAGE_HIDDEN; +} + +#endif /* __ASSEMBLY__ */ /* * the pgd page can be thought of an array like this: pgd_t[PTRS_PER_PGD] @@ -476,28 +508,6 @@ static inline void __init paravirt_paget #ifndef __ASSEMBLY__ -enum { - PG_LEVEL_NONE, - PG_LEVEL_4K, - PG_LEVEL_2M, - PG_LEVEL_1G, - PG_LEVEL_NUM -}; - -#ifdef CONFIG_PROC_FS -extern void update_page_count(int level, unsigned long pages); -#else -static inline void update_page_count(int level, unsigned long pages) { } -#endif - -/* - * Helper function that returns the kernel pagetable entry controlling - * the virtual address 'address'. NULL means no pagetable entry present. - * NOTE: the return type is pte_t but if the pmd is PSE then we return it - * as a pte too. - */ -extern pte_t *lookup_address(unsigned long address, unsigned int *level); - /* local pte updates need not use xchg for locking */ static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep) { Index: linux-2.6-tip/arch/x86/include/asm/pgtable_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable_32.h +++ linux-2.6-tip/arch/x86/include/asm/pgtable_32.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_PGTABLE_32_H #define _ASM_X86_PGTABLE_32_H +#include /* * The Linux memory management assumes a three-level page table setup. On @@ -33,47 +34,6 @@ void paging_init(void); extern void set_pmd_pfn(unsigned long, unsigned long, pgprot_t); -/* - * The Linux x86 paging architecture is 'compile-time dual-mode', it - * implements both the traditional 2-level x86 page tables and the - * newer 3-level PAE-mode page tables. - */ -#ifdef CONFIG_X86_PAE -# include -# define PMD_SIZE (1UL << PMD_SHIFT) -# define PMD_MASK (~(PMD_SIZE - 1)) -#else -# include -#endif - -#define PGDIR_SIZE (1UL << PGDIR_SHIFT) -#define PGDIR_MASK (~(PGDIR_SIZE - 1)) - -/* Just any arbitrary offset to the start of the vmalloc VM area: the - * current 8MB value just means that there will be a 8MB "hole" after the - * physical memory until the kernel virtual memory starts. That means that - * any out-of-bounds memory accesses will hopefully be caught. - * The vmalloc() routines leaves a hole of 4kB between each vmalloced - * area for the same reason. ;) - */ -#define VMALLOC_OFFSET (8 * 1024 * 1024) -#define VMALLOC_START ((unsigned long)high_memory + VMALLOC_OFFSET) -#ifdef CONFIG_X86_PAE -#define LAST_PKMAP 512 -#else -#define LAST_PKMAP 1024 -#endif - -#define PKMAP_BASE ((FIXADDR_BOOT_START - PAGE_SIZE * (LAST_PKMAP + 1)) \ - & PMD_MASK) - -#ifdef CONFIG_HIGHMEM -# define VMALLOC_END (PKMAP_BASE - 2 * PAGE_SIZE) -#else -# define VMALLOC_END (FIXADDR_START - 2 * PAGE_SIZE) -#endif - -#define MAXMEM (VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE) /* * Define this if things work differently on an i386 and an i486: @@ -85,55 +45,12 @@ extern void set_pmd_pfn(unsigned long, u /* The boot page tables (all created as a single array) */ extern unsigned long pg0[]; -#define pte_present(x) ((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE)) - -/* To avoid harmful races, pmd_none(x) should check only the lower when PAE */ -#define pmd_none(x) (!(unsigned long)pmd_val((x))) -#define pmd_present(x) (pmd_val((x)) & _PAGE_PRESENT) -#define pmd_bad(x) ((pmd_val(x) & (PTE_FLAGS_MASK & ~_PAGE_USER)) != _KERNPG_TABLE) - -#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) - #ifdef CONFIG_X86_PAE # include #else # include #endif -/* - * Conversion functions: convert a page and protection to a page entry, - * and a page entry and page directory to the page they refer to. - */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) - - -static inline int pud_large(pud_t pud) { return 0; } - -/* - * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD] - * - * this macro returns the index of the entry in the pmd page which would - * control the given virtual address - */ -#define pmd_index(address) \ - (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)) - -/* - * the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE] - * - * this macro returns the index of the entry in the pte page which would - * control the given virtual address - */ -#define pte_index(address) \ - (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) -#define pte_offset_kernel(dir, address) \ - ((pte_t *)pmd_page_vaddr(*(dir)) + pte_index((address))) - -#define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT)) - -#define pmd_page_vaddr(pmd) \ - ((unsigned long)__va(pmd_val((pmd)) & PTE_PFN_MASK)) - #if defined(CONFIG_HIGHPTE) #define pte_offset_map(dir, address) \ ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \ @@ -141,14 +58,20 @@ static inline int pud_large(pud_t pud) { #define pte_offset_map_nested(dir, address) \ ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \ pte_index((address))) +#define pte_offset_map_direct(dir, address) \ + ((pte_t *)kmap_atomic_pte_direct(pmd_page(*(dir)), KM_PTE0) + \ + pte_index((address))) #define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0) #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1) +#define pte_unmap_direct(pte) kunmap_atomic_direct((pte), KM_PTE0) #else #define pte_offset_map(dir, address) \ ((pte_t *)page_address(pmd_page(*(dir))) + pte_index((address))) #define pte_offset_map_nested(dir, address) pte_offset_map((dir), (address)) +#define pte_offset_map_direct(dir, address) pte_offset_map((dir), (address)) #define pte_unmap(pte) do { } while (0) #define pte_unmap_nested(pte) do { } while (0) +#define pte_unmap_direct(pte) do { } while (0) #endif /* Clear a kernel PTE and flush it from the TLB */ @@ -176,7 +99,4 @@ do { \ #define kern_addr_valid(kaddr) (0) #endif -#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ - remap_pfn_range(vma, vaddr, pfn, size, prot) - #endif /* _ASM_X86_PGTABLE_32_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable_32_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pgtable_32_types.h @@ -0,0 +1,46 @@ +#ifndef _ASM_X86_PGTABLE_32_DEFS_H +#define _ASM_X86_PGTABLE_32_DEFS_H + +/* + * The Linux x86 paging architecture is 'compile-time dual-mode', it + * implements both the traditional 2-level x86 page tables and the + * newer 3-level PAE-mode page tables. + */ +#ifdef CONFIG_X86_PAE +# include +# define PMD_SIZE (1UL << PMD_SHIFT) +# define PMD_MASK (~(PMD_SIZE - 1)) +#else +# include +#endif + +#define PGDIR_SIZE (1UL << PGDIR_SHIFT) +#define PGDIR_MASK (~(PGDIR_SIZE - 1)) + +/* Just any arbitrary offset to the start of the vmalloc VM area: the + * current 8MB value just means that there will be a 8MB "hole" after the + * physical memory until the kernel virtual memory starts. That means that + * any out-of-bounds memory accesses will hopefully be caught. + * The vmalloc() routines leaves a hole of 4kB between each vmalloced + * area for the same reason. ;) + */ +#define VMALLOC_OFFSET (8 * 1024 * 1024) +#define VMALLOC_START ((unsigned long)high_memory + VMALLOC_OFFSET) +#ifdef CONFIG_X86_PAE +#define LAST_PKMAP 512 +#else +#define LAST_PKMAP 1024 +#endif + +#define PKMAP_BASE ((FIXADDR_BOOT_START - PAGE_SIZE * (LAST_PKMAP + 1)) \ + & PMD_MASK) + +#ifdef CONFIG_HIGHMEM +# define VMALLOC_END (PKMAP_BASE - 2 * PAGE_SIZE) +#else +# define VMALLOC_END (FIXADDR_START - 2 * PAGE_SIZE) +#endif + +#define MAXMEM (VMALLOC_END - PAGE_OFFSET - __VMALLOC_RESERVE) + +#endif /* _ASM_X86_PGTABLE_32_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pgtable_64.h +++ linux-2.6-tip/arch/x86/include/asm/pgtable_64.h @@ -2,6 +2,8 @@ #define _ASM_X86_PGTABLE_64_H #include +#include + #ifndef __ASSEMBLY__ /* @@ -11,7 +13,6 @@ #include #include #include -#include extern pud_t level3_kernel_pgt[512]; extern pud_t level3_ident_pgt[512]; @@ -26,32 +27,6 @@ extern void paging_init(void); #endif /* !__ASSEMBLY__ */ -#define SHARED_KERNEL_PMD 0 - -/* - * PGDIR_SHIFT determines what a top-level page table entry can map - */ -#define PGDIR_SHIFT 39 -#define PTRS_PER_PGD 512 - -/* - * 3rd level page - */ -#define PUD_SHIFT 30 -#define PTRS_PER_PUD 512 - -/* - * PMD_SHIFT determines the size of the area a middle-level - * page table can map - */ -#define PMD_SHIFT 21 -#define PTRS_PER_PMD 512 - -/* - * entries per page directory level - */ -#define PTRS_PER_PTE 512 - #ifndef __ASSEMBLY__ #define pte_ERROR(e) \ @@ -67,9 +42,6 @@ extern void paging_init(void); printk("%s:%d: bad pgd %p(%016lx).\n", \ __FILE__, __LINE__, &(e), pgd_val(e)) -#define pgd_none(x) (!pgd_val(x)) -#define pud_none(x) (!pud_val(x)) - struct mm_struct; void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte); @@ -134,48 +106,6 @@ static inline void native_pgd_clear(pgd_ native_set_pgd(pgd, native_make_pgd(0)); } -#define pte_same(a, b) ((a).pte == (b).pte) - -#endif /* !__ASSEMBLY__ */ - -#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT) -#define PMD_MASK (~(PMD_SIZE - 1)) -#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT) -#define PUD_MASK (~(PUD_SIZE - 1)) -#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT) -#define PGDIR_MASK (~(PGDIR_SIZE - 1)) - - -#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) -#define VMALLOC_START _AC(0xffffc20000000000, UL) -#define VMALLOC_END _AC(0xffffe1ffffffffff, UL) -#define VMEMMAP_START _AC(0xffffe20000000000, UL) -#define MODULES_VADDR _AC(0xffffffffa0000000, UL) -#define MODULES_END _AC(0xffffffffff000000, UL) -#define MODULES_LEN (MODULES_END - MODULES_VADDR) - -#ifndef __ASSEMBLY__ - -static inline int pgd_bad(pgd_t pgd) -{ - return (pgd_val(pgd) & ~(PTE_PFN_MASK | _PAGE_USER)) != _KERNPG_TABLE; -} - -static inline int pud_bad(pud_t pud) -{ - return (pud_val(pud) & ~(PTE_PFN_MASK | _PAGE_USER)) != _KERNPG_TABLE; -} - -static inline int pmd_bad(pmd_t pmd) -{ - return (pmd_val(pmd) & ~(PTE_PFN_MASK | _PAGE_USER)) != _KERNPG_TABLE; -} - -#define pte_none(x) (!pte_val((x))) -#define pte_present(x) (pte_val((x)) & (_PAGE_PRESENT | _PAGE_PROTNONE)) - -#define pages_to_mb(x) ((x) >> (20 - PAGE_SHIFT)) /* FIXME: is this right? */ - /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. @@ -184,41 +114,12 @@ static inline int pmd_bad(pmd_t pmd) /* * Level 4 access. */ -#define pgd_page_vaddr(pgd) \ - ((unsigned long)__va((unsigned long)pgd_val((pgd)) & PTE_PFN_MASK)) -#define pgd_page(pgd) (pfn_to_page(pgd_val((pgd)) >> PAGE_SHIFT)) -#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT) static inline int pgd_large(pgd_t pgd) { return 0; } #define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE) /* PUD - Level3 access */ -/* to find an entry in a page-table-directory. */ -#define pud_page_vaddr(pud) \ - ((unsigned long)__va(pud_val((pud)) & PHYSICAL_PAGE_MASK)) -#define pud_page(pud) (pfn_to_page(pud_val((pud)) >> PAGE_SHIFT)) -#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1)) -#define pud_offset(pgd, address) \ - ((pud_t *)pgd_page_vaddr(*(pgd)) + pud_index((address))) -#define pud_present(pud) (pud_val((pud)) & _PAGE_PRESENT) - -static inline int pud_large(pud_t pte) -{ - return (pud_val(pte) & (_PAGE_PSE | _PAGE_PRESENT)) == - (_PAGE_PSE | _PAGE_PRESENT); -} /* PMD - Level 2 access */ -#define pmd_page_vaddr(pmd) ((unsigned long) __va(pmd_val((pmd)) & PTE_PFN_MASK)) -#define pmd_page(pmd) (pfn_to_page(pmd_val((pmd)) >> PAGE_SHIFT)) - -#define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)) -#define pmd_offset(dir, address) ((pmd_t *)pud_page_vaddr(*(dir)) + \ - pmd_index(address)) -#define pmd_none(x) (!pmd_val((x))) -#define pmd_present(x) (pmd_val((x)) & _PAGE_PRESENT) -#define pfn_pmd(nr, prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val((prot)))) -#define pmd_pfn(x) ((pmd_val((x)) & __PHYSICAL_MASK) >> PAGE_SHIFT) - #define pte_to_pgoff(pte) ((pte_val((pte)) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT) #define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | \ _PAGE_FILE }) @@ -226,18 +127,13 @@ static inline int pud_large(pud_t pte) /* PTE - Level 1 access. */ -/* page, protection -> pte */ -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn((page)), (pgprot)) - -#define pte_index(address) (((address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) -#define pte_offset_kernel(dir, address) ((pte_t *) pmd_page_vaddr(*(dir)) + \ - pte_index((address))) - /* x86-64 always has all page tables mapped. */ #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address)) #define pte_offset_map_nested(dir, address) pte_offset_kernel((dir), (address)) -#define pte_unmap(pte) /* NOP */ -#define pte_unmap_nested(pte) /* NOP */ +#define pte_offset_map_direct(dir, address) pte_offset_kernel((dir), (address)) +#define pte_unmap(pte) do { } while (0) +#define pte_unmap_nested(pte) do { } while (0) +#define pte_unmap_direct(pte) do { } while (0) #define update_mmu_cache(vma, address, pte) do { } while (0) @@ -266,9 +162,6 @@ extern int direct_gbpages; extern int kern_addr_valid(unsigned long addr); extern void cleanup_highmap(void); -#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ - remap_pfn_range(vma, vaddr, pfn, size, prot) - #define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN Index: linux-2.6-tip/arch/x86/include/asm/pgtable_64_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pgtable_64_types.h @@ -0,0 +1,63 @@ +#ifndef _ASM_X86_PGTABLE_64_DEFS_H +#define _ASM_X86_PGTABLE_64_DEFS_H + +#ifndef __ASSEMBLY__ +#include + +/* + * These are used to make use of C type-checking.. + */ +typedef unsigned long pteval_t; +typedef unsigned long pmdval_t; +typedef unsigned long pudval_t; +typedef unsigned long pgdval_t; +typedef unsigned long pgprotval_t; + +typedef struct { pteval_t pte; } pte_t; + +#endif /* !__ASSEMBLY__ */ + +#define SHARED_KERNEL_PMD 0 +#define PAGETABLE_LEVELS 4 + +/* + * PGDIR_SHIFT determines what a top-level page table entry can map + */ +#define PGDIR_SHIFT 39 +#define PTRS_PER_PGD 512 + +/* + * 3rd level page + */ +#define PUD_SHIFT 30 +#define PTRS_PER_PUD 512 + +/* + * PMD_SHIFT determines the size of the area a middle-level + * page table can map + */ +#define PMD_SHIFT 21 +#define PTRS_PER_PMD 512 + +/* + * entries per page directory level + */ +#define PTRS_PER_PTE 512 + +#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT) +#define PMD_MASK (~(PMD_SIZE - 1)) +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE - 1)) +#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT) +#define PGDIR_MASK (~(PGDIR_SIZE - 1)) + + +#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) +#define VMALLOC_START _AC(0xffffc20000000000, UL) +#define VMALLOC_END _AC(0xffffe1ffffffffff, UL) +#define VMEMMAP_START _AC(0xffffe20000000000, UL) +#define MODULES_VADDR _AC(0xffffffffa0000000, UL) +#define MODULES_END _AC(0xffffffffff000000, UL) +#define MODULES_LEN (MODULES_END - MODULES_VADDR) + +#endif /* _ASM_X86_PGTABLE_64_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/pgtable_types.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/pgtable_types.h @@ -0,0 +1,333 @@ +#ifndef _ASM_X86_PGTABLE_DEFS_H +#define _ASM_X86_PGTABLE_DEFS_H + +#include +#include + +#define FIRST_USER_ADDRESS 0 + +#define _PAGE_BIT_PRESENT 0 /* is present */ +#define _PAGE_BIT_RW 1 /* writeable */ +#define _PAGE_BIT_USER 2 /* userspace addressable */ +#define _PAGE_BIT_PWT 3 /* page write through */ +#define _PAGE_BIT_PCD 4 /* page cache disabled */ +#define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */ +#define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */ +#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ +#define _PAGE_BIT_PAT 7 /* on 4KB pages */ +#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ +#define _PAGE_BIT_UNUSED1 9 /* available for programmer */ +#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */ +#define _PAGE_BIT_HIDDEN 11 /* hidden by kmemcheck */ +#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ +#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1 +#define _PAGE_BIT_CPA_TEST _PAGE_BIT_UNUSED1 +#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ + +/* If _PAGE_BIT_PRESENT is clear, we use these: */ +/* - if the user mapped it with PROT_NONE; pte_present gives true */ +#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL +/* - set: nonlinear file mapping, saved PTE; unset:swap */ +#define _PAGE_BIT_FILE _PAGE_BIT_DIRTY + +#define _PAGE_PRESENT (_AT(pteval_t, 1) << _PAGE_BIT_PRESENT) +#define _PAGE_RW (_AT(pteval_t, 1) << _PAGE_BIT_RW) +#define _PAGE_USER (_AT(pteval_t, 1) << _PAGE_BIT_USER) +#define _PAGE_PWT (_AT(pteval_t, 1) << _PAGE_BIT_PWT) +#define _PAGE_PCD (_AT(pteval_t, 1) << _PAGE_BIT_PCD) +#define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED) +#define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY) +#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) +#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) +#define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1) +#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP) +#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT) +#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) +#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) +#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) +#define __HAVE_ARCH_PTE_SPECIAL + +#ifdef CONFIG_KMEMCHECK +#define _PAGE_HIDDEN (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN) +#else +#define _PAGE_HIDDEN (_AT(pteval_t, 0)) +#endif + +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) +#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) +#else +#define _PAGE_NX (_AT(pteval_t, 0)) +#endif + +#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) +#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) + +#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ + _PAGE_ACCESSED | _PAGE_DIRTY) +#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ + _PAGE_DIRTY) + +/* Set of bits not changed in pte_modify */ +#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ + _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY) + +#define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT) +#define _PAGE_CACHE_WB (0) +#define _PAGE_CACHE_WC (_PAGE_PWT) +#define _PAGE_CACHE_UC_MINUS (_PAGE_PCD) +#define _PAGE_CACHE_UC (_PAGE_PCD | _PAGE_PWT) + +#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED) +#define PAGE_SHARED __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ + _PAGE_ACCESSED | _PAGE_NX) + +#define PAGE_SHARED_EXEC __pgprot(_PAGE_PRESENT | _PAGE_RW | \ + _PAGE_USER | _PAGE_ACCESSED) +#define PAGE_COPY_NOEXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ + _PAGE_ACCESSED | _PAGE_NX) +#define PAGE_COPY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ + _PAGE_ACCESSED) +#define PAGE_COPY PAGE_COPY_NOEXEC +#define PAGE_READONLY __pgprot(_PAGE_PRESENT | _PAGE_USER | \ + _PAGE_ACCESSED | _PAGE_NX) +#define PAGE_READONLY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \ + _PAGE_ACCESSED) + +#define __PAGE_KERNEL_EXEC \ + (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL) +#define __PAGE_KERNEL (__PAGE_KERNEL_EXEC | _PAGE_NX) + +#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW) +#define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW) +#define __PAGE_KERNEL_EXEC_NOCACHE (__PAGE_KERNEL_EXEC | _PAGE_PCD | _PAGE_PWT) +#define __PAGE_KERNEL_WC (__PAGE_KERNEL | _PAGE_CACHE_WC) +#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT) +#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD) +#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER) +#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT) +#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE) +#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE) +#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE) + +#define __PAGE_KERNEL_IO (__PAGE_KERNEL | _PAGE_IOMAP) +#define __PAGE_KERNEL_IO_NOCACHE (__PAGE_KERNEL_NOCACHE | _PAGE_IOMAP) +#define __PAGE_KERNEL_IO_UC_MINUS (__PAGE_KERNEL_UC_MINUS | _PAGE_IOMAP) +#define __PAGE_KERNEL_IO_WC (__PAGE_KERNEL_WC | _PAGE_IOMAP) + +#define PAGE_KERNEL __pgprot(__PAGE_KERNEL) +#define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO) +#define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC) +#define PAGE_KERNEL_RX __pgprot(__PAGE_KERNEL_RX) +#define PAGE_KERNEL_WC __pgprot(__PAGE_KERNEL_WC) +#define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE) +#define PAGE_KERNEL_UC_MINUS __pgprot(__PAGE_KERNEL_UC_MINUS) +#define PAGE_KERNEL_EXEC_NOCACHE __pgprot(__PAGE_KERNEL_EXEC_NOCACHE) +#define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE) +#define PAGE_KERNEL_LARGE_NOCACHE __pgprot(__PAGE_KERNEL_LARGE_NOCACHE) +#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC) +#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL) +#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE) + +#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO) +#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE) +#define PAGE_KERNEL_IO_UC_MINUS __pgprot(__PAGE_KERNEL_IO_UC_MINUS) +#define PAGE_KERNEL_IO_WC __pgprot(__PAGE_KERNEL_IO_WC) + +/* xwr */ +#define __P000 PAGE_NONE +#define __P001 PAGE_READONLY +#define __P010 PAGE_COPY +#define __P011 PAGE_COPY +#define __P100 PAGE_READONLY_EXEC +#define __P101 PAGE_READONLY_EXEC +#define __P110 PAGE_COPY_EXEC +#define __P111 PAGE_COPY_EXEC + +#define __S000 PAGE_NONE +#define __S001 PAGE_READONLY +#define __S010 PAGE_SHARED +#define __S011 PAGE_SHARED +#define __S100 PAGE_READONLY_EXEC +#define __S101 PAGE_READONLY_EXEC +#define __S110 PAGE_SHARED_EXEC +#define __S111 PAGE_SHARED_EXEC + +/* + * early identity mapping pte attrib macros. + */ +#ifdef CONFIG_X86_64 +#define __PAGE_KERNEL_IDENT_LARGE_EXEC __PAGE_KERNEL_LARGE_EXEC +#else +/* + * For PDE_IDENT_ATTR include USER bit. As the PDE and PTE protection + * bits are combined, this will alow user to access the high address mapped + * VDSO in the presence of CONFIG_COMPAT_VDSO + */ +#define PTE_IDENT_ATTR 0x003 /* PRESENT+RW */ +#define PDE_IDENT_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */ +#define PGD_IDENT_ATTR 0x001 /* PRESENT (no other attributes) */ +#endif + +#ifdef CONFIG_X86_32 +# include "pgtable_32_types.h" +#else +# include "pgtable_64_types.h" +#endif + +#ifndef __ASSEMBLY__ + +#include + +/* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */ +#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) + +/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */ +#define PTE_FLAGS_MASK (~PTE_PFN_MASK) + +typedef struct pgprot { pgprotval_t pgprot; } pgprot_t; + +typedef struct { pgdval_t pgd; } pgd_t; + +static inline pgd_t native_make_pgd(pgdval_t val) +{ + return (pgd_t) { val }; +} + +static inline pgdval_t native_pgd_val(pgd_t pgd) +{ + return pgd.pgd; +} + +static inline pgdval_t pgd_flags(pgd_t pgd) +{ + return native_pgd_val(pgd) & PTE_FLAGS_MASK; +} + +#if PAGETABLE_LEVELS > 3 +typedef struct { pudval_t pud; } pud_t; + +static inline pud_t native_make_pud(pmdval_t val) +{ + return (pud_t) { val }; +} + +static inline pudval_t native_pud_val(pud_t pud) +{ + return pud.pud; +} +#else +#include + +static inline pudval_t native_pud_val(pud_t pud) +{ + return native_pgd_val(pud.pgd); +} +#endif + +#if PAGETABLE_LEVELS > 2 +typedef struct { pmdval_t pmd; } pmd_t; + +static inline pmd_t native_make_pmd(pmdval_t val) +{ + return (pmd_t) { val }; +} + +static inline pmdval_t native_pmd_val(pmd_t pmd) +{ + return pmd.pmd; +} +#else +#include + +static inline pmdval_t native_pmd_val(pmd_t pmd) +{ + return native_pgd_val(pmd.pud.pgd); +} +#endif + +static inline pudval_t pud_flags(pud_t pud) +{ + return native_pud_val(pud) & PTE_FLAGS_MASK; +} + +static inline pmdval_t pmd_flags(pmd_t pmd) +{ + return native_pmd_val(pmd) & PTE_FLAGS_MASK; +} + +static inline pte_t native_make_pte(pteval_t val) +{ + return (pte_t) { .pte = val }; +} + +static inline pteval_t native_pte_val(pte_t pte) +{ + return pte.pte; +} + +static inline pteval_t pte_flags(pte_t pte) +{ + return native_pte_val(pte) & PTE_FLAGS_MASK; +} + +#define pgprot_val(x) ((x).pgprot) +#define __pgprot(x) ((pgprot_t) { (x) } ) + + +typedef struct page *pgtable_t; + +extern pteval_t __supported_pte_mask; +extern int nx_enabled; + +#define pgprot_writecombine pgprot_writecombine +extern pgprot_t pgprot_writecombine(pgprot_t prot); + +/* Indicate that x86 has its own track and untrack pfn vma functions */ +#define __HAVE_PFNMAP_TRACKING + +#define __HAVE_PHYS_MEM_ACCESS_PROT +struct file; +pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, + unsigned long size, pgprot_t vma_prot); +int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, + unsigned long size, pgprot_t *vma_prot); + +/* Install a pte for a particular vaddr in kernel space. */ +void set_pte_vaddr(unsigned long vaddr, pte_t pte); + +#ifdef CONFIG_X86_32 +extern void native_pagetable_setup_start(pgd_t *base); +extern void native_pagetable_setup_done(pgd_t *base); +#else +static inline void native_pagetable_setup_start(pgd_t *base) {} +static inline void native_pagetable_setup_done(pgd_t *base) {} +#endif + +struct seq_file; +extern void arch_report_meminfo(struct seq_file *m); + +enum { + PG_LEVEL_NONE, + PG_LEVEL_4K, + PG_LEVEL_2M, + PG_LEVEL_1G, + PG_LEVEL_NUM +}; + +#ifdef CONFIG_PROC_FS +extern void update_page_count(int level, unsigned long pages); +#else +static inline void update_page_count(int level, unsigned long pages) { } +#endif + +/* + * Helper function that returns the kernel pagetable entry controlling + * the virtual address 'address'. NULL means no pagetable entry present. + * NOTE: the return type is pte_t but if the pmd is PSE then we return it + * as a pte too. + */ +extern pte_t *lookup_address(unsigned long address, unsigned int *level); + +#endif /* !__ASSEMBLY__ */ + +#endif /* _ASM_X86_PGTABLE_DEFS_H */ Index: linux-2.6-tip/arch/x86/include/asm/prctl.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/prctl.h +++ linux-2.6-tip/arch/x86/include/asm/prctl.h @@ -6,8 +6,4 @@ #define ARCH_GET_FS 0x1003 #define ARCH_GET_GS 0x1004 -#ifdef CONFIG_X86_64 -extern long sys_arch_prctl(int, unsigned long); -#endif /* CONFIG_X86_64 */ - #endif /* _ASM_X86_PRCTL_H */ Index: linux-2.6-tip/arch/x86/include/asm/processor.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/processor.h +++ linux-2.6-tip/arch/x86/include/asm/processor.h @@ -16,6 +16,7 @@ struct mm_struct; #include #include #include +#include #include #include #include @@ -73,7 +74,7 @@ struct cpuinfo_x86 { char pad0; #else /* Number of 4K pages in DTLB/ITLB combined(in pages): */ - int x86_tlbsize; + int x86_tlbsize; __u8 x86_virt_bits; __u8 x86_phys_bits; #endif @@ -378,9 +379,30 @@ union thread_xstate { #ifdef CONFIG_X86_64 DECLARE_PER_CPU(struct orig_ist, orig_ist); + +union irq_stack_union { + char irq_stack[IRQ_STACK_SIZE]; + /* + * GCC hardcodes the stack canary as %gs:40. Since the + * irq_stack is the object at %gs:0, we reserve the bottom + * 48 bytes of the irq stack for the canary. + */ + struct { + char gs_base[40]; + unsigned long stack_canary; + }; +}; + +DECLARE_PER_CPU(union irq_stack_union, irq_stack_union); +DECLARE_INIT_PER_CPU(irq_stack_union); + +DECLARE_PER_CPU(char *, irq_stack_ptr); +#else /* X86_64 */ +#ifdef CONFIG_CC_STACKPROTECTOR +DECLARE_PER_CPU(unsigned long, stack_canary); #endif +#endif /* X86_64 */ -extern void print_cpu_info(struct cpuinfo_x86 *); extern unsigned int xstate_size; extern void free_thread_xstate(struct task_struct *); extern struct kmem_cache *task_xstate_cachep; @@ -752,9 +774,9 @@ extern int sysenter_setup(void); extern struct desc_ptr early_gdt_descr; extern void cpu_set_gdt(int); -extern void switch_to_new_gdt(void); +extern void switch_to_new_gdt(int); +extern void load_percpu_segment(int); extern void cpu_init(void); -extern void init_gdt(int cpu); static inline unsigned long get_debugctlmsr(void) { @@ -839,6 +861,7 @@ static inline void spin_lock_prefetch(co * User space process size: 3GB (default). */ #define TASK_SIZE PAGE_OFFSET +#define TASK_SIZE_MAX TASK_SIZE #define STACK_TOP TASK_SIZE #define STACK_TOP_MAX STACK_TOP @@ -898,7 +921,7 @@ extern unsigned long thread_saved_pc(str /* * User space process size. 47bits minus one guard page. */ -#define TASK_SIZE64 ((1UL << 47) - PAGE_SIZE) +#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE) /* This decides where the kernel will search for a free chunk of vm * space during mmap's. @@ -907,12 +930,12 @@ extern unsigned long thread_saved_pc(str 0xc0000000 : 0xFFFFe000) #define TASK_SIZE (test_thread_flag(TIF_IA32) ? \ - IA32_PAGE_OFFSET : TASK_SIZE64) + IA32_PAGE_OFFSET : TASK_SIZE_MAX) #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32)) ? \ - IA32_PAGE_OFFSET : TASK_SIZE64) + IA32_PAGE_OFFSET : TASK_SIZE_MAX) #define STACK_TOP TASK_SIZE -#define STACK_TOP_MAX TASK_SIZE64 +#define STACK_TOP_MAX TASK_SIZE_MAX #define INIT_THREAD { \ .sp0 = (unsigned long)&init_stack + sizeof(init_stack) \ Index: linux-2.6-tip/arch/x86/include/asm/proto.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/proto.h +++ linux-2.6-tip/arch/x86/include/asm/proto.h @@ -18,11 +18,7 @@ extern void syscall32_cpu_init(void); extern void check_efer(void); -#ifdef CONFIG_X86_BIOS_REBOOT extern int reboot_force; -#else -static const int reboot_force = 0; -#endif long do_arch_prctl(struct task_struct *task, int code, unsigned long addr); Index: linux-2.6-tip/arch/x86/include/asm/ptrace.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/ptrace.h +++ linux-2.6-tip/arch/x86/include/asm/ptrace.h @@ -28,7 +28,7 @@ struct pt_regs { int xds; int xes; int xfs; - /* int gs; */ + int xgs; long orig_eax; long eip; int xcs; @@ -50,7 +50,7 @@ struct pt_regs { unsigned long ds; unsigned long es; unsigned long fs; - /* int gs; */ + unsigned long gs; unsigned long orig_ax; unsigned long ip; unsigned long cs; Index: linux-2.6-tip/arch/x86/include/asm/rdc321x_defs.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/rdc321x_defs.h @@ -0,0 +1,12 @@ +#define PFX "rdc321x: " + +/* General purpose configuration and data registers */ +#define RDC3210_CFGREG_ADDR 0x0CF8 +#define RDC3210_CFGREG_DATA 0x0CFC + +#define RDC321X_GPIO_CTRL_REG1 0x48 +#define RDC321X_GPIO_CTRL_REG2 0x84 +#define RDC321X_GPIO_DATA_REG1 0x4c +#define RDC321X_GPIO_DATA_REG2 0x88 + +#define RDC321X_MAX_GPIO 58 Index: linux-2.6-tip/arch/x86/include/asm/segment.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/segment.h +++ linux-2.6-tip/arch/x86/include/asm/segment.h @@ -61,7 +61,7 @@ * * 26 - ESPFIX small SS * 27 - per-cpu [ offset to per-cpu data area ] - * 28 - unused + * 28 - stack_canary-20 [ for stack protector ] * 29 - unused * 30 - unused * 31 - TSS for double fault handler @@ -95,6 +95,13 @@ #define __KERNEL_PERCPU 0 #endif +#define GDT_ENTRY_STACK_CANARY (GDT_ENTRY_KERNEL_BASE + 16) +#ifdef CONFIG_CC_STACKPROTECTOR +#define __KERNEL_STACK_CANARY (GDT_ENTRY_STACK_CANARY * 8) +#else +#define __KERNEL_STACK_CANARY 0 +#endif + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* Index: linux-2.6-tip/arch/x86/include/asm/setup.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/setup.h +++ linux-2.6-tip/arch/x86/include/asm/setup.h @@ -1,33 +1,19 @@ #ifndef _ASM_X86_SETUP_H #define _ASM_X86_SETUP_H +#ifdef __KERNEL__ + #define COMMAND_LINE_SIZE 2048 #ifndef __ASSEMBLY__ -/* Interrupt control for vSMPowered x86_64 systems */ -void vsmp_init(void); - - -void setup_bios_corruption_check(void); - - -#ifdef CONFIG_X86_VISWS -extern void visws_early_detect(void); -extern int is_visws_box(void); -#else -static inline void visws_early_detect(void) { } -static inline int is_visws_box(void) { return 0; } -#endif - -extern int wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip); -extern int wakeup_secondary_cpu_via_init(int apicid, unsigned long start_eip); /* * Any setup quirks to be performed? */ struct mpc_cpu; struct mpc_bus; struct mpc_oemtable; + struct x86_quirks { int (*arch_pre_time_init)(void); int (*arch_time_init)(void); @@ -43,20 +29,20 @@ struct x86_quirks { void (*mpc_oem_bus_info)(struct mpc_bus *m, char *name); void (*mpc_oem_pci_bus)(struct mpc_bus *m); void (*smp_read_mpc_oem)(struct mpc_oemtable *oemtable, - unsigned short oemsize); + unsigned short oemsize); int (*setup_ioapic_ids)(void); - int (*update_genapic)(void); + int (*update_apic)(void); }; -extern struct x86_quirks *x86_quirks; -extern unsigned long saved_video_mode; +extern void x86_quirk_pre_intr_init(void); +extern void x86_quirk_intr_init(void); -#ifndef CONFIG_PARAVIRT -#define paravirt_post_allocator_init() do {} while (0) -#endif -#endif /* __ASSEMBLY__ */ +extern void x86_quirk_trap_init(void); -#ifdef __KERNEL__ +extern void x86_quirk_pre_time_init(void); +extern void x86_quirk_time_init(void); + +#endif /* __ASSEMBLY__ */ #ifdef __i386__ @@ -78,6 +64,28 @@ extern unsigned long saved_video_mode; #ifndef __ASSEMBLY__ #include +/* Interrupt control for vSMPowered x86_64 systems */ +void vsmp_init(void); + +void setup_bios_corruption_check(void); + +#ifdef CONFIG_X86_VISWS +extern void visws_early_detect(void); +extern int is_visws_box(void); +#else +static inline void visws_early_detect(void) { } +static inline int is_visws_box(void) { return 0; } +#endif + +extern int wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip); +extern int wakeup_secondary_cpu_via_init(int apicid, unsigned long start_eip); +extern struct x86_quirks *x86_quirks; +extern unsigned long saved_video_mode; + +#ifndef CONFIG_PARAVIRT +#define paravirt_post_allocator_init() do {} while (0) +#endif + #ifndef _SETUP /* @@ -100,7 +108,6 @@ extern unsigned long init_pg_tables_star extern unsigned long init_pg_tables_end; #else -void __init x86_64_init_pda(void); void __init x86_64_start_kernel(char *real_mode); void __init x86_64_start_reservations(char *real_mode_data); Index: linux-2.6-tip/arch/x86/include/asm/setup_arch.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/setup_arch.h @@ -0,0 +1,3 @@ +/* Hook to call BIOS initialisation function */ + +/* no action for generic */ Index: linux-2.6-tip/arch/x86/include/asm/smp.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/smp.h +++ linux-2.6-tip/arch/x86/include/asm/smp.h @@ -15,34 +15,8 @@ # include # endif #endif -#include #include - -#ifdef CONFIG_X86_64 - -extern cpumask_var_t cpu_callin_mask; -extern cpumask_var_t cpu_callout_mask; -extern cpumask_var_t cpu_initialized_mask; -extern cpumask_var_t cpu_sibling_setup_mask; - -#else /* CONFIG_X86_32 */ - -extern cpumask_t cpu_callin_map; -extern cpumask_t cpu_callout_map; -extern cpumask_t cpu_initialized; -extern cpumask_t cpu_sibling_setup_map; - -#define cpu_callin_mask ((struct cpumask *)&cpu_callin_map) -#define cpu_callout_mask ((struct cpumask *)&cpu_callout_map) -#define cpu_initialized_mask ((struct cpumask *)&cpu_initialized) -#define cpu_sibling_setup_mask ((struct cpumask *)&cpu_sibling_setup_map) - -#endif /* CONFIG_X86_32 */ - -extern void (*mtrr_hook)(void); -extern void zap_low_mappings(void); - -extern int __cpuinit get_local_pda(int cpu); +#include extern int smp_num_siblings; extern unsigned int num_processors; @@ -50,9 +24,7 @@ extern unsigned int num_processors; DECLARE_PER_CPU(cpumask_t, cpu_sibling_map); DECLARE_PER_CPU(cpumask_t, cpu_core_map); DECLARE_PER_CPU(u16, cpu_llc_id); -#ifdef CONFIG_X86_32 DECLARE_PER_CPU(int, cpu_number); -#endif static inline struct cpumask *cpu_sibling_mask(int cpu) { @@ -167,8 +139,6 @@ void play_dead_common(void); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); -extern void prefill_possible_map(void); - void smp_store_cpu_info(int id); #define cpu_physical_id(cpu) per_cpu(x86_cpu_to_apicid, cpu) @@ -177,10 +147,6 @@ static inline int num_booting_cpus(void) { return cpumask_weight(cpu_callout_mask); } -#else -static inline void prefill_possible_map(void) -{ -} #endif /* CONFIG_SMP */ extern unsigned disabled_cpus __cpuinitdata; @@ -191,11 +157,11 @@ extern unsigned disabled_cpus __cpuinitd * from the initial startup. We map APIC_BASE very early in page_setup(), * so this is correct in the x86 case. */ -#define raw_smp_processor_id() (x86_read_percpu(cpu_number)) +#define raw_smp_processor_id() (percpu_read(cpu_number)) extern int safe_smp_processor_id(void); #elif defined(CONFIG_X86_64_SMP) -#define raw_smp_processor_id() read_pda(cpunumber) +#define raw_smp_processor_id() (percpu_read(cpu_number)) #define stack_smp_processor_id() \ ({ \ @@ -205,10 +171,6 @@ extern int safe_smp_processor_id(void); }) #define safe_smp_processor_id() smp_processor_id() -#else /* !CONFIG_X86_32_SMP && !CONFIG_X86_64_SMP */ -#define cpu_physical_id(cpu) boot_cpu_physical_apicid -#define safe_smp_processor_id() 0 -#define stack_smp_processor_id() 0 #endif #ifdef CONFIG_X86_LOCAL_APIC @@ -220,28 +182,9 @@ static inline int logical_smp_processor_ return GET_APIC_LOGICAL_ID(*(u32 *)(APIC_BASE + APIC_LDR)); } -#include -static inline unsigned int read_apic_id(void) -{ - unsigned int reg; - - reg = *(u32 *)(APIC_BASE + APIC_ID); - - return GET_APIC_ID(reg); -} #endif - -# if defined(APIC_DEFINITION) || defined(CONFIG_X86_64) extern int hard_smp_processor_id(void); -# else -#include -static inline int hard_smp_processor_id(void) -{ - /* we don't want to mark this access volatile - bad code generation */ - return read_apic_id(); -} -# endif /* APIC_DEFINITION */ #else /* CONFIG_X86_LOCAL_APIC */ @@ -251,11 +194,5 @@ static inline int hard_smp_processor_id( #endif /* CONFIG_X86_LOCAL_APIC */ -#ifdef CONFIG_X86_HAS_BOOT_CPU_ID -extern unsigned char boot_cpu_id; -#else -#define boot_cpu_id 0 -#endif - #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_SMP_H */ Index: linux-2.6-tip/arch/x86/include/asm/smpboot_hooks.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/smpboot_hooks.h @@ -0,0 +1,61 @@ +/* two abstractions specific to kernel/smpboot.c, mainly to cater to visws + * which needs to alter them. */ + +static inline void smpboot_clear_io_apic_irqs(void) +{ +#ifdef CONFIG_X86_IO_APIC + io_apic_irqs = 0; +#endif +} + +static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip) +{ + CMOS_WRITE(0xa, 0xf); + local_flush_tlb(); + pr_debug("1.\n"); + *((volatile unsigned short *)phys_to_virt(apic->trampoline_phys_high)) = + start_eip >> 4; + pr_debug("2.\n"); + *((volatile unsigned short *)phys_to_virt(apic->trampoline_phys_low)) = + start_eip & 0xf; + pr_debug("3.\n"); +} + +static inline void smpboot_restore_warm_reset_vector(void) +{ + /* + * Install writable page 0 entry to set BIOS data area. + */ + local_flush_tlb(); + + /* + * Paranoid: Set warm reset code and vector here back + * to default values. + */ + CMOS_WRITE(0, 0xf); + + *((volatile long *)phys_to_virt(apic->trampoline_phys_low)) = 0; +} + +static inline void __init smpboot_setup_io_apic(void) +{ +#ifdef CONFIG_X86_IO_APIC + /* + * Here we can be sure that there is an IO-APIC in the system. Let's + * go and set it up: + */ + if (!skip_ioapic_setup && nr_ioapics) + setup_IO_APIC(); + else { + nr_ioapics = 0; + localise_nmi_watchdog(); + } +#endif +} + +static inline void smpboot_clear_io_apic(void) +{ +#ifdef CONFIG_X86_IO_APIC + nr_ioapics = 0; +#endif +} Index: linux-2.6-tip/arch/x86/include/asm/spinlock.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/spinlock.h +++ linux-2.6-tip/arch/x86/include/asm/spinlock.h @@ -58,7 +58,7 @@ #if (NR_CPUS < 256) #define TICKET_SHIFT 8 -static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) +static __always_inline void __ticket_spin_lock(__raw_spinlock_t *lock) { short inc = 0x0100; @@ -77,7 +77,7 @@ static __always_inline void __ticket_spi : "memory", "cc"); } -static __always_inline int __ticket_spin_trylock(raw_spinlock_t *lock) +static __always_inline int __ticket_spin_trylock(__raw_spinlock_t *lock) { int tmp, new; @@ -96,7 +96,7 @@ static __always_inline int __ticket_spin return tmp; } -static __always_inline void __ticket_spin_unlock(raw_spinlock_t *lock) +static __always_inline void __ticket_spin_unlock(__raw_spinlock_t *lock) { asm volatile(UNLOCK_LOCK_PREFIX "incb %0" : "+m" (lock->slock) @@ -106,7 +106,7 @@ static __always_inline void __ticket_spi #else #define TICKET_SHIFT 16 -static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) +static __always_inline void __ticket_spin_lock(__raw_spinlock_t *lock) { int inc = 0x00010000; int tmp; @@ -127,7 +127,7 @@ static __always_inline void __ticket_spi : "memory", "cc"); } -static __always_inline int __ticket_spin_trylock(raw_spinlock_t *lock) +static __always_inline int __ticket_spin_trylock(__raw_spinlock_t *lock) { int tmp; int new; @@ -149,7 +149,7 @@ static __always_inline int __ticket_spin return tmp; } -static __always_inline void __ticket_spin_unlock(raw_spinlock_t *lock) +static __always_inline void __ticket_spin_unlock(__raw_spinlock_t *lock) { asm volatile(UNLOCK_LOCK_PREFIX "incw %0" : "+m" (lock->slock) @@ -158,119 +158,57 @@ static __always_inline void __ticket_spi } #endif -static inline int __ticket_spin_is_locked(raw_spinlock_t *lock) +static inline int __ticket_spin_is_locked(__raw_spinlock_t *lock) { int tmp = ACCESS_ONCE(lock->slock); return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1)); } -static inline int __ticket_spin_is_contended(raw_spinlock_t *lock) +static inline int __ticket_spin_is_contended(__raw_spinlock_t *lock) { int tmp = ACCESS_ONCE(lock->slock); return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1; } -#ifdef CONFIG_PARAVIRT -/* - * Define virtualization-friendly old-style lock byte lock, for use in - * pv_lock_ops if desired. - * - * This differs from the pre-2.6.24 spinlock by always using xchgb - * rather than decb to take the lock; this allows it to use a - * zero-initialized lock structure. It also maintains a 1-byte - * contention counter, so that we can implement - * __byte_spin_is_contended. - */ -struct __byte_spinlock { - s8 lock; - s8 spinners; -}; - -static inline int __byte_spin_is_locked(raw_spinlock_t *lock) -{ - struct __byte_spinlock *bl = (struct __byte_spinlock *)lock; - return bl->lock != 0; -} +#ifndef CONFIG_PARAVIRT -static inline int __byte_spin_is_contended(raw_spinlock_t *lock) -{ - struct __byte_spinlock *bl = (struct __byte_spinlock *)lock; - return bl->spinners != 0; -} - -static inline void __byte_spin_lock(raw_spinlock_t *lock) -{ - struct __byte_spinlock *bl = (struct __byte_spinlock *)lock; - s8 val = 1; - - asm("1: xchgb %1, %0\n" - " test %1,%1\n" - " jz 3f\n" - " " LOCK_PREFIX "incb %2\n" - "2: rep;nop\n" - " cmpb $1, %0\n" - " je 2b\n" - " " LOCK_PREFIX "decb %2\n" - " jmp 1b\n" - "3:" - : "+m" (bl->lock), "+q" (val), "+m" (bl->spinners): : "memory"); -} - -static inline int __byte_spin_trylock(raw_spinlock_t *lock) -{ - struct __byte_spinlock *bl = (struct __byte_spinlock *)lock; - u8 old = 1; - - asm("xchgb %1,%0" - : "+m" (bl->lock), "+q" (old) : : "memory"); - - return old == 0; -} - -static inline void __byte_spin_unlock(raw_spinlock_t *lock) -{ - struct __byte_spinlock *bl = (struct __byte_spinlock *)lock; - smp_wmb(); - bl->lock = 0; -} -#else /* !CONFIG_PARAVIRT */ -static inline int __raw_spin_is_locked(raw_spinlock_t *lock) +static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) { return __ticket_spin_is_locked(lock); } -static inline int __raw_spin_is_contended(raw_spinlock_t *lock) +static inline int __raw_spin_is_contended(__raw_spinlock_t *lock) { return __ticket_spin_is_contended(lock); } #define __raw_spin_is_contended __raw_spin_is_contended -static __always_inline void __raw_spin_lock(raw_spinlock_t *lock) +static __always_inline void __raw_spin_lock(__raw_spinlock_t *lock) { __ticket_spin_lock(lock); } -static __always_inline int __raw_spin_trylock(raw_spinlock_t *lock) +static __always_inline int __raw_spin_trylock(__raw_spinlock_t *lock) { return __ticket_spin_trylock(lock); } -static __always_inline void __raw_spin_unlock(raw_spinlock_t *lock) +static __always_inline void __raw_spin_unlock(__raw_spinlock_t *lock) { __ticket_spin_unlock(lock); } -static __always_inline void __raw_spin_lock_flags(raw_spinlock_t *lock, +static __always_inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { __raw_spin_lock(lock); } -#endif /* CONFIG_PARAVIRT */ +#endif -static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) +static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) { while (__raw_spin_is_locked(lock)) cpu_relax(); @@ -294,7 +232,7 @@ static inline void __raw_spin_unlock_wai * read_can_lock - would read_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_read_can_lock(raw_rwlock_t *lock) +static inline int __raw_read_can_lock(__raw_rwlock_t *lock) { return (int)(lock)->lock > 0; } @@ -303,12 +241,12 @@ static inline int __raw_read_can_lock(ra * write_can_lock - would write_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_write_can_lock(raw_rwlock_t *lock) +static inline int __raw_write_can_lock(__raw_rwlock_t *lock) { return (lock)->lock == RW_LOCK_BIAS; } -static inline void __raw_read_lock(raw_rwlock_t *rw) +static inline void __raw_read_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t" "jns 1f\n" @@ -317,7 +255,7 @@ static inline void __raw_read_lock(raw_r ::LOCK_PTR_REG (rw) : "memory"); } -static inline void __raw_write_lock(raw_rwlock_t *rw) +static inline void __raw_write_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " subl %1,(%0)\n\t" "jz 1f\n" @@ -326,18 +264,17 @@ static inline void __raw_write_lock(raw_ ::LOCK_PTR_REG (rw), "i" (RW_LOCK_BIAS) : "memory"); } -static inline int __raw_read_trylock(raw_rwlock_t *lock) +static inline int __raw_read_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; - atomic_dec(count); - if (atomic_read(count) >= 0) + if (atomic_dec_return(count) >= 0) return 1; atomic_inc(count); return 0; } -static inline int __raw_write_trylock(raw_rwlock_t *lock) +static inline int __raw_write_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; @@ -347,19 +284,19 @@ static inline int __raw_write_trylock(ra return 0; } -static inline void __raw_read_unlock(raw_rwlock_t *rw) +static inline void __raw_read_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory"); } -static inline void __raw_write_unlock(raw_rwlock_t *rw) +static inline void __raw_write_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "addl %1, %0" : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory"); } -#define _raw_spin_relax(lock) cpu_relax() -#define _raw_read_relax(lock) cpu_relax() -#define _raw_write_relax(lock) cpu_relax() +#define __raw_spin_relax(lock) cpu_relax() +#define __raw_read_relax(lock) cpu_relax() +#define __raw_write_relax(lock) cpu_relax() #endif /* _ASM_X86_SPINLOCK_H */ Index: linux-2.6-tip/arch/x86/include/asm/stackprotector.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/stackprotector.h @@ -0,0 +1,124 @@ +/* + * GCC stack protector support. + * + * Stack protector works by putting predefined pattern at the start of + * the stack frame and verifying that it hasn't been overwritten when + * returning from the function. The pattern is called stack canary + * and unfortunately gcc requires it to be at a fixed offset from %gs. + * On x86_64, the offset is 40 bytes and on x86_32 20 bytes. x86_64 + * and x86_32 use segment registers differently and thus handles this + * requirement differently. + * + * On x86_64, %gs is shared by percpu area and stack canary. All + * percpu symbols are zero based and %gs points to the base of percpu + * area. The first occupant of the percpu area is always + * irq_stack_union which contains stack_canary at offset 40. Userland + * %gs is always saved and restored on kernel entry and exit using + * swapgs, so stack protector doesn't add any complexity there. + * + * On x86_32, it's slightly more complicated. As in x86_64, %gs is + * used for userland TLS. Unfortunately, some processors are much + * slower at loading segment registers with different value when + * entering and leaving the kernel, so the kernel uses %fs for percpu + * area and manages %gs lazily so that %gs is switched only when + * necessary, usually during task switch. + * + * As gcc requires the stack canary at %gs:20, %gs can't be managed + * lazily if stack protector is enabled, so the kernel saves and + * restores userland %gs on kernel entry and exit. This behavior is + * controlled by CONFIG_X86_32_LAZY_GS and accessors are defined in + * system.h to hide the details. + */ + +#ifndef _ASM_STACKPROTECTOR_H +#define _ASM_STACKPROTECTOR_H 1 + +#ifdef CONFIG_CC_STACKPROTECTOR + +#include +#include +#include +#include +#include +#include + +/* + * 24 byte read-only segment initializer for stack canary. Linker + * can't handle the address bit shifting. Address will be set in + * head_32 for boot CPU and setup_per_cpu_areas() for others. + */ +#define GDT_STACK_CANARY_INIT \ + [GDT_ENTRY_STACK_CANARY] = { { { 0x00000018, 0x00409000 } } }, + +/* + * Initialize the stackprotector canary value. + * + * NOTE: this must only be called from functions that never return, + * and it must always be inlined. + */ +static __always_inline void boot_init_stack_canary(void) +{ + u64 canary; + u64 tsc; + +#ifdef CONFIG_X86_64 + BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40); +#endif + /* + * We both use the random pool and the current TSC as a source + * of randomness. The TSC only matters for very early init, + * there it already has some randomness on most systems. Later + * on during the bootup the random pool has true entropy too. + */ + get_random_bytes(&canary, sizeof(canary)); + tsc = __native_read_tsc(); + canary += tsc + (tsc << 32UL); + + current->stack_canary = canary; +#ifdef CONFIG_X86_64 + percpu_write(irq_stack_union.stack_canary, canary); +#else + percpu_write(stack_canary, canary); +#endif +} + +static inline void setup_stack_canary_segment(int cpu) +{ +#ifdef CONFIG_X86_32 + unsigned long canary = (unsigned long)&per_cpu(stack_canary, cpu) - 20; + struct desc_struct *gdt_table = get_cpu_gdt_table(cpu); + struct desc_struct desc; + + desc = gdt_table[GDT_ENTRY_STACK_CANARY]; + desc.base0 = canary & 0xffff; + desc.base1 = (canary >> 16) & 0xff; + desc.base2 = (canary >> 24) & 0xff; + write_gdt_entry(gdt_table, GDT_ENTRY_STACK_CANARY, &desc, DESCTYPE_S); +#endif +} + +static inline void load_stack_canary_segment(void) +{ +#ifdef CONFIG_X86_32 + asm("mov %0, %%gs" : : "r" (__KERNEL_STACK_CANARY) : "memory"); +#endif +} + +#else /* CC_STACKPROTECTOR */ + +#define GDT_STACK_CANARY_INIT + +/* dummy boot_init_stack_canary() is defined in linux/stackprotector.h */ + +static inline void setup_stack_canary_segment(int cpu) +{ } + +static inline void load_stack_canary_segment(void) +{ +#ifdef CONFIG_X86_32 + asm volatile ("mov %0, %%gs" : : "r" (0)); +#endif +} + +#endif /* CC_STACKPROTECTOR */ +#endif /* _ASM_STACKPROTECTOR_H */ Index: linux-2.6-tip/arch/x86/include/asm/string_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/string_32.h +++ linux-2.6-tip/arch/x86/include/asm/string_32.h @@ -177,10 +177,18 @@ static inline void *__memcpy3d(void *to, * No 3D Now! */ +#ifndef CONFIG_KMEMCHECK #define memcpy(t, f, n) \ (__builtin_constant_p((n)) \ ? __constant_memcpy((t), (f), (n)) \ : __memcpy((t), (f), (n))) +#else +/* + * kmemcheck becomes very happy if we use the REP instructions unconditionally, + * because it means that we know both memory operands in advance. + */ +#define memcpy(t, f, n) __memcpy((t), (f), (n)) +#endif #endif Index: linux-2.6-tip/arch/x86/include/asm/string_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/string_64.h +++ linux-2.6-tip/arch/x86/include/asm/string_64.h @@ -27,6 +27,7 @@ static __always_inline void *__inline_me function. */ #define __HAVE_ARCH_MEMCPY 1 +#ifndef CONFIG_KMEMCHECK #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 extern void *memcpy(void *to, const void *from, size_t len); #else @@ -42,6 +43,13 @@ extern void *__memcpy(void *to, const vo __ret; \ }) #endif +#else +/* + * kmemcheck becomes very happy if we use the REP instructions unconditionally, + * because it means that we know both memory operands in advance. + */ +#define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len)) +#endif #define __HAVE_ARCH_MEMSET void *memset(void *s, int c, size_t n); Index: linux-2.6-tip/arch/x86/include/asm/summit/apic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/summit/apic.h +++ /dev/null @@ -1,202 +0,0 @@ -#ifndef __ASM_SUMMIT_APIC_H -#define __ASM_SUMMIT_APIC_H - -#include -#include - -#define esr_disable (1) -#define NO_BALANCE_IRQ (0) - -/* In clustered mode, the high nibble of APIC ID is a cluster number. - * The low nibble is a 4-bit bitmap. */ -#define XAPIC_DEST_CPUS_SHIFT 4 -#define XAPIC_DEST_CPUS_MASK ((1u << XAPIC_DEST_CPUS_SHIFT) - 1) -#define XAPIC_DEST_CLUSTER_MASK (XAPIC_DEST_CPUS_MASK << XAPIC_DEST_CPUS_SHIFT) - -#define APIC_DFR_VALUE (APIC_DFR_CLUSTER) - -static inline const cpumask_t *target_cpus(void) -{ - /* CPU_MASK_ALL (0xff) has undefined behaviour with - * dest_LowestPrio mode logical clustered apic interrupt routing - * Just start on cpu 0. IRQ balancing will spread load - */ - return &cpumask_of_cpu(0); -} - -#define INT_DELIVERY_MODE (dest_LowestPrio) -#define INT_DEST_MODE 1 /* logical delivery broadcast to all procs */ - -static inline unsigned long check_apicid_used(physid_mask_t bitmap, int apicid) -{ - return 0; -} - -/* we don't use the phys_cpu_present_map to indicate apicid presence */ -static inline unsigned long check_apicid_present(int bit) -{ - return 1; -} - -#define apicid_cluster(apicid) ((apicid) & XAPIC_DEST_CLUSTER_MASK) - -extern u8 cpu_2_logical_apicid[]; - -static inline void init_apic_ldr(void) -{ - unsigned long val, id; - int count = 0; - u8 my_id = (u8)hard_smp_processor_id(); - u8 my_cluster = (u8)apicid_cluster(my_id); -#ifdef CONFIG_SMP - u8 lid; - int i; - - /* Create logical APIC IDs by counting CPUs already in cluster. */ - for (count = 0, i = nr_cpu_ids; --i >= 0; ) { - lid = cpu_2_logical_apicid[i]; - if (lid != BAD_APICID && apicid_cluster(lid) == my_cluster) - ++count; - } -#endif - /* We only have a 4 wide bitmap in cluster mode. If a deranged - * BIOS puts 5 CPUs in one APIC cluster, we're hosed. */ - BUG_ON(count >= XAPIC_DEST_CPUS_SHIFT); - id = my_cluster | (1UL << count); - apic_write(APIC_DFR, APIC_DFR_VALUE); - val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; - val |= SET_APIC_LOGICAL_ID(id); - apic_write(APIC_LDR, val); -} - -static inline int multi_timer_check(int apic, int irq) -{ - return 0; -} - -static inline int apic_id_registered(void) -{ - return 1; -} - -static inline void setup_apic_routing(void) -{ - printk("Enabling APIC mode: Summit. Using %d I/O APICs\n", - nr_ioapics); -} - -static inline int apicid_to_node(int logical_apicid) -{ -#ifdef CONFIG_SMP - return apicid_2_node[hard_smp_processor_id()]; -#else - return 0; -#endif -} - -/* Mapping from cpu number to logical apicid */ -static inline int cpu_to_logical_apicid(int cpu) -{ -#ifdef CONFIG_SMP - if (cpu >= nr_cpu_ids) - return BAD_APICID; - return (int)cpu_2_logical_apicid[cpu]; -#else - return logical_smp_processor_id(); -#endif -} - -static inline int cpu_present_to_apicid(int mps_cpu) -{ - if (mps_cpu < nr_cpu_ids) - return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); - else - return BAD_APICID; -} - -static inline physid_mask_t ioapic_phys_id_map(physid_mask_t phys_id_map) -{ - /* For clustered we don't have a good way to do this yet - hack */ - return physids_promote(0x0F); -} - -static inline physid_mask_t apicid_to_cpu_present(int apicid) -{ - return physid_mask_of_physid(0); -} - -static inline void setup_portio_remap(void) -{ -} - -static inline int check_phys_apicid_present(int boot_cpu_physical_apicid) -{ - return 1; -} - -static inline void enable_apic_mode(void) -{ -} - -static inline unsigned int cpu_mask_to_apicid(const cpumask_t *cpumask) -{ - int num_bits_set; - int cpus_found = 0; - int cpu; - int apicid; - - num_bits_set = cpus_weight(*cpumask); - /* Return id to all */ - if (num_bits_set >= nr_cpu_ids) - return (int) 0xFF; - /* - * The cpus in the mask must all be on the apic cluster. If are not - * on the same apicid cluster return default value of TARGET_CPUS. - */ - cpu = first_cpu(*cpumask); - apicid = cpu_to_logical_apicid(cpu); - while (cpus_found < num_bits_set) { - if (cpu_isset(cpu, *cpumask)) { - int new_apicid = cpu_to_logical_apicid(cpu); - if (apicid_cluster(apicid) != - apicid_cluster(new_apicid)){ - printk ("%s: Not a valid mask!\n", __func__); - return 0xFF; - } - apicid = apicid | new_apicid; - cpus_found++; - } - cpu++; - } - return apicid; -} - -static inline unsigned int cpu_mask_to_apicid_and(const struct cpumask *inmask, - const struct cpumask *andmask) -{ - int apicid = cpu_to_logical_apicid(0); - cpumask_var_t cpumask; - - if (!alloc_cpumask_var(&cpumask, GFP_ATOMIC)) - return apicid; - - cpumask_and(cpumask, inmask, andmask); - cpumask_and(cpumask, cpumask, cpu_online_mask); - apicid = cpu_mask_to_apicid(cpumask); - - free_cpumask_var(cpumask); - return apicid; -} - -/* cpuid returns the value latched in the HW at reset, not the APIC ID - * register's value. For any box whose BIOS changes APIC IDs, like - * clustered APIC systems, we must use hard_smp_processor_id. - * - * See Intel's IA-32 SW Dev's Manual Vol2 under CPUID. - */ -static inline u32 phys_pkg_id(u32 cpuid_apic, int index_msb) -{ - return hard_smp_processor_id() >> index_msb; -} - -#endif /* __ASM_SUMMIT_APIC_H */ Index: linux-2.6-tip/arch/x86/include/asm/summit/apicdef.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/summit/apicdef.h +++ /dev/null @@ -1,13 +0,0 @@ -#ifndef __ASM_SUMMIT_APICDEF_H -#define __ASM_SUMMIT_APICDEF_H - -#define APIC_ID_MASK (0xFF<<24) - -static inline unsigned get_apic_id(unsigned long x) -{ - return (x>>24)&0xFF; -} - -#define GET_APIC_ID(x) get_apic_id(x) - -#endif Index: linux-2.6-tip/arch/x86/include/asm/summit/ipi.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/summit/ipi.h +++ /dev/null @@ -1,26 +0,0 @@ -#ifndef __ASM_SUMMIT_IPI_H -#define __ASM_SUMMIT_IPI_H - -void send_IPI_mask_sequence(const cpumask_t *mask, int vector); -void send_IPI_mask_allbutself(const cpumask_t *mask, int vector); - -static inline void send_IPI_mask(const cpumask_t *mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static inline void send_IPI_allbutself(int vector) -{ - cpumask_t mask = cpu_online_map; - cpu_clear(smp_processor_id(), mask); - - if (!cpus_empty(mask)) - send_IPI_mask(&mask, vector); -} - -static inline void send_IPI_all(int vector) -{ - send_IPI_mask(&cpu_online_map, vector); -} - -#endif /* __ASM_SUMMIT_IPI_H */ Index: linux-2.6-tip/arch/x86/include/asm/summit/mpparse.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/summit/mpparse.h +++ /dev/null @@ -1,109 +0,0 @@ -#ifndef __ASM_SUMMIT_MPPARSE_H -#define __ASM_SUMMIT_MPPARSE_H - -#include - -extern int use_cyclone; - -#ifdef CONFIG_X86_SUMMIT_NUMA -extern void setup_summit(void); -#else -#define setup_summit() {} -#endif - -static inline int mps_oem_check(struct mpc_table *mpc, char *oem, - char *productid) -{ - if (!strncmp(oem, "IBM ENSW", 8) && - (!strncmp(productid, "VIGIL SMP", 9) - || !strncmp(productid, "EXA", 3) - || !strncmp(productid, "RUTHLESS SMP", 12))){ - mark_tsc_unstable("Summit based system"); - use_cyclone = 1; /*enable cyclone-timer*/ - setup_summit(); - return 1; - } - return 0; -} - -/* Hook from generic ACPI tables.c */ -static inline int acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - if (!strncmp(oem_id, "IBM", 3) && - (!strncmp(oem_table_id, "SERVIGIL", 8) - || !strncmp(oem_table_id, "EXA", 3))){ - mark_tsc_unstable("Summit based system"); - use_cyclone = 1; /*enable cyclone-timer*/ - setup_summit(); - return 1; - } - return 0; -} - -struct rio_table_hdr { - unsigned char version; /* Version number of this data structure */ - /* Version 3 adds chassis_num & WP_index */ - unsigned char num_scal_dev; /* # of Scalability devices (Twisters for Vigil) */ - unsigned char num_rio_dev; /* # of RIO I/O devices (Cyclones and Winnipegs) */ -} __attribute__((packed)); - -struct scal_detail { - unsigned char node_id; /* Scalability Node ID */ - unsigned long CBAR; /* Address of 1MB register space */ - unsigned char port0node; /* Node ID port connected to: 0xFF=None */ - unsigned char port0port; /* Port num port connected to: 0,1,2, or 0xFF=None */ - unsigned char port1node; /* Node ID port connected to: 0xFF = None */ - unsigned char port1port; /* Port num port connected to: 0,1,2, or 0xFF=None */ - unsigned char port2node; /* Node ID port connected to: 0xFF = None */ - unsigned char port2port; /* Port num port connected to: 0,1,2, or 0xFF=None */ - unsigned char chassis_num; /* 1 based Chassis number (1 = boot node) */ -} __attribute__((packed)); - -struct rio_detail { - unsigned char node_id; /* RIO Node ID */ - unsigned long BBAR; /* Address of 1MB register space */ - unsigned char type; /* Type of device */ - unsigned char owner_id; /* For WPEG: Node ID of Cyclone that owns this WPEG*/ - /* For CYC: Node ID of Twister that owns this CYC */ - unsigned char port0node; /* Node ID port connected to: 0xFF=None */ - unsigned char port0port; /* Port num port connected to: 0,1,2, or 0xFF=None */ - unsigned char port1node; /* Node ID port connected to: 0xFF=None */ - unsigned char port1port; /* Port num port connected to: 0,1,2, or 0xFF=None */ - unsigned char first_slot; /* For WPEG: Lowest slot number below this WPEG */ - /* For CYC: 0 */ - unsigned char status; /* For WPEG: Bit 0 = 1 : the XAPIC is used */ - /* = 0 : the XAPIC is not used, ie:*/ - /* ints fwded to another XAPIC */ - /* Bits1:7 Reserved */ - /* For CYC: Bits0:7 Reserved */ - unsigned char WP_index; /* For WPEG: WPEG instance index - lower ones have */ - /* lower slot numbers/PCI bus numbers */ - /* For CYC: No meaning */ - unsigned char chassis_num; /* 1 based Chassis number */ - /* For LookOut WPEGs this field indicates the */ - /* Expansion Chassis #, enumerated from Boot */ - /* Node WPEG external port, then Boot Node CYC */ - /* external port, then Next Vigil chassis WPEG */ - /* external port, etc. */ - /* Shared Lookouts have only 1 chassis number (the */ - /* first one assigned) */ -} __attribute__((packed)); - - -typedef enum { - CompatTwister = 0, /* Compatibility Twister */ - AltTwister = 1, /* Alternate Twister of internal 8-way */ - CompatCyclone = 2, /* Compatibility Cyclone */ - AltCyclone = 3, /* Alternate Cyclone of internal 8-way */ - CompatWPEG = 4, /* Compatibility WPEG */ - AltWPEG = 5, /* Second Planar WPEG */ - LookOutAWPEG = 6, /* LookOut WPEG */ - LookOutBWPEG = 7, /* LookOut WPEG */ -} node_type; - -static inline int is_WPEG(struct rio_detail *rio){ - return (rio->type == CompatWPEG || rio->type == AltWPEG || - rio->type == LookOutAWPEG || rio->type == LookOutBWPEG); -} - -#endif /* __ASM_SUMMIT_MPPARSE_H */ Index: linux-2.6-tip/arch/x86/include/asm/syscalls.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/syscalls.h +++ linux-2.6-tip/arch/x86/include/asm/syscalls.h @@ -29,21 +29,21 @@ asmlinkage int sys_get_thread_area(struc /* X86_32 only */ #ifdef CONFIG_X86_32 /* kernel/process_32.c */ -asmlinkage int sys_fork(struct pt_regs); -asmlinkage int sys_clone(struct pt_regs); -asmlinkage int sys_vfork(struct pt_regs); -asmlinkage int sys_execve(struct pt_regs); +int sys_fork(struct pt_regs *); +int sys_clone(struct pt_regs *); +int sys_vfork(struct pt_regs *); +int sys_execve(struct pt_regs *); /* kernel/signal_32.c */ asmlinkage int sys_sigsuspend(int, int, old_sigset_t); asmlinkage int sys_sigaction(int, const struct old_sigaction __user *, struct old_sigaction __user *); -asmlinkage int sys_sigaltstack(unsigned long); -asmlinkage unsigned long sys_sigreturn(unsigned long); -asmlinkage int sys_rt_sigreturn(unsigned long); +int sys_sigaltstack(struct pt_regs *); +unsigned long sys_sigreturn(struct pt_regs *); +long sys_rt_sigreturn(struct pt_regs *); /* kernel/ioport.c */ -asmlinkage long sys_iopl(unsigned long); +long sys_iopl(struct pt_regs *); /* kernel/sys_i386_32.c */ asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long, @@ -59,8 +59,8 @@ struct oldold_utsname; asmlinkage int sys_olduname(struct oldold_utsname __user *); /* kernel/vm86_32.c */ -asmlinkage int sys_vm86old(struct pt_regs); -asmlinkage int sys_vm86(struct pt_regs); +int sys_vm86old(struct pt_regs *); +int sys_vm86(struct pt_regs *); #else /* CONFIG_X86_32 */ @@ -74,6 +74,7 @@ asmlinkage long sys_vfork(struct pt_regs asmlinkage long sys_execve(char __user *, char __user * __user *, char __user * __user *, struct pt_regs *); +long sys_arch_prctl(int, unsigned long); /* kernel/ioport.c */ asmlinkage long sys_iopl(unsigned int, struct pt_regs *); @@ -81,7 +82,7 @@ asmlinkage long sys_iopl(unsigned int, s /* kernel/signal_64.c */ asmlinkage long sys_sigaltstack(const stack_t __user *, stack_t __user *, struct pt_regs *); -asmlinkage long sys_rt_sigreturn(struct pt_regs *); +long sys_rt_sigreturn(struct pt_regs *); /* kernel/sys_x86_64.c */ asmlinkage long sys_mmap(unsigned long, unsigned long, unsigned long, Index: linux-2.6-tip/arch/x86/include/asm/system.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/system.h +++ linux-2.6-tip/arch/x86/include/asm/system.h @@ -23,6 +23,20 @@ struct task_struct *__switch_to(struct t #ifdef CONFIG_X86_32 +#ifdef CONFIG_CC_STACKPROTECTOR +#define __switch_canary \ + "movl %P[task_canary](%[next]), %%ebx\n\t" \ + "movl %%ebx, "__percpu_arg([stack_canary])"\n\t" +#define __switch_canary_oparam \ + , [stack_canary] "=m" (per_cpu_var(stack_canary)) +#define __switch_canary_iparam \ + , [task_canary] "i" (offsetof(struct task_struct, stack_canary)) +#else /* CC_STACKPROTECTOR */ +#define __switch_canary +#define __switch_canary_oparam +#define __switch_canary_iparam +#endif /* CC_STACKPROTECTOR */ + /* * Saving eflags is important. It switches not only IOPL between tasks, * it also protects other tasks from NT leaking through sysenter etc. @@ -44,6 +58,7 @@ do { \ "movl %[next_sp],%%esp\n\t" /* restore ESP */ \ "movl $1f,%[prev_ip]\n\t" /* save EIP */ \ "pushl %[next_ip]\n\t" /* restore EIP */ \ + __switch_canary \ "jmp __switch_to\n" /* regparm call */ \ "1:\t" \ "popl %%ebp\n\t" /* restore EBP */ \ @@ -58,6 +73,8 @@ do { \ "=b" (ebx), "=c" (ecx), "=d" (edx), \ "=S" (esi), "=D" (edi) \ \ + __switch_canary_oparam \ + \ /* input parameters: */ \ : [next_sp] "m" (next->thread.sp), \ [next_ip] "m" (next->thread.ip), \ @@ -66,6 +83,8 @@ do { \ [prev] "a" (prev), \ [next] "d" (next) \ \ + __switch_canary_iparam \ + \ : /* reloaded segment registers */ \ "memory"); \ } while (0) @@ -86,27 +105,44 @@ do { \ , "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \ "r12", "r13", "r14", "r15" +#ifdef CONFIG_CC_STACKPROTECTOR +#define __switch_canary \ + "movq %P[task_canary](%%rsi),%%r8\n\t" \ + "movq %%r8,"__percpu_arg([gs_canary])"\n\t" +#define __switch_canary_oparam \ + , [gs_canary] "=m" (per_cpu_var(irq_stack_union.stack_canary)) +#define __switch_canary_iparam \ + , [task_canary] "i" (offsetof(struct task_struct, stack_canary)) +#else /* CC_STACKPROTECTOR */ +#define __switch_canary +#define __switch_canary_oparam +#define __switch_canary_iparam +#endif /* CC_STACKPROTECTOR */ + /* Save restore flags to clear handle leaking NT */ #define switch_to(prev, next, last) \ - asm volatile(SAVE_CONTEXT \ + asm volatile(SAVE_CONTEXT \ "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \ "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \ "call __switch_to\n\t" \ ".globl thread_return\n" \ "thread_return:\n\t" \ - "movq %%gs:%P[pda_pcurrent],%%rsi\n\t" \ + "movq "__percpu_arg([current_task])",%%rsi\n\t" \ + __switch_canary \ "movq %P[thread_info](%%rsi),%%r8\n\t" \ - LOCK_PREFIX "btr %[tif_fork],%P[ti_flags](%%r8)\n\t" \ "movq %%rax,%%rdi\n\t" \ - "jc ret_from_fork\n\t" \ + "testl %[_tif_fork],%P[ti_flags](%%r8)\n\t" \ + "jnz ret_from_fork\n\t" \ RESTORE_CONTEXT \ : "=a" (last) \ + __switch_canary_oparam \ : [next] "S" (next), [prev] "D" (prev), \ [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \ [ti_flags] "i" (offsetof(struct thread_info, flags)), \ - [tif_fork] "i" (TIF_FORK), \ + [_tif_fork] "i" (_TIF_FORK), \ [thread_info] "i" (offsetof(struct task_struct, stack)), \ - [pda_pcurrent] "i" (offsetof(struct x8664_pda, pcurrent)) \ + [current_task] "m" (per_cpu_var(current_task)) \ + __switch_canary_iparam \ : "memory", "cc" __EXTRA_CLOBBER) #endif @@ -165,6 +201,25 @@ extern void native_load_gs_index(unsigne #define savesegment(seg, value) \ asm("mov %%" #seg ",%0":"=r" (value) : : "memory") +/* + * x86_32 user gs accessors. + */ +#ifdef CONFIG_X86_32 +#ifdef CONFIG_X86_32_LAZY_GS +#define get_user_gs(regs) (u16)({unsigned long v; savesegment(gs, v); v;}) +#define set_user_gs(regs, v) loadsegment(gs, (unsigned long)(v)) +#define task_user_gs(tsk) ((tsk)->thread.gs) +#define lazy_save_gs(v) savesegment(gs, (v)) +#define lazy_load_gs(v) loadsegment(gs, (v)) +#else /* X86_32_LAZY_GS */ +#define get_user_gs(regs) (u16)((regs)->gs) +#define set_user_gs(regs, v) do { (regs)->gs = (v); } while (0) +#define task_user_gs(tsk) (task_pt_regs(tsk)->gs) +#define lazy_save_gs(v) do { } while (0) +#define lazy_load_gs(v) do { } while (0) +#endif /* X86_32_LAZY_GS */ +#endif /* X86_32 */ + static inline unsigned long get_limit(unsigned long segment) { unsigned long __limit; Index: linux-2.6-tip/arch/x86/include/asm/thread_info.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/thread_info.h +++ linux-2.6-tip/arch/x86/include/asm/thread_info.h @@ -40,6 +40,7 @@ struct thread_info { */ __u8 supervisor_stack[0]; #endif + int uaccess_err; }; #define INIT_THREAD_INFO(tsk) \ @@ -82,6 +83,7 @@ struct thread_info { #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SECCOMP 8 /* secure computing */ #define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */ +#define TIF_PERF_COUNTERS 11 /* notify perf counter work */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* 32bit process */ #define TIF_FORK 18 /* ret_from_fork */ @@ -104,6 +106,7 @@ struct thread_info { #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP (1 << TIF_SECCOMP) #define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY) +#define _TIF_PERF_COUNTERS (1 << TIF_PERF_COUNTERS) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) #define _TIF_FORK (1 << TIF_FORK) @@ -135,7 +138,7 @@ struct thread_info { /* Only used for 64 bit */ #define _TIF_DO_NOTIFY_MASK \ - (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME) + (_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW \ @@ -148,9 +151,9 @@ struct thread_info { /* thread information allocation */ #ifdef CONFIG_DEBUG_STACK_USAGE -#define THREAD_FLAGS (GFP_KERNEL | __GFP_ZERO) +#define THREAD_FLAGS (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO) #else -#define THREAD_FLAGS GFP_KERNEL +#define THREAD_FLAGS (GFP_KERNEL | __GFP_NOTRACK) #endif #define __HAVE_ARCH_THREAD_INFO_ALLOCATOR @@ -194,25 +197,21 @@ static inline struct thread_info *curren #else /* X86_32 */ -#include +#include +#define KERNEL_STACK_OFFSET (5*8) /* * macros/functions for gaining access to the thread information structure * preempt_count needs to be 1 initially, until the scheduler is functional. */ #ifndef __ASSEMBLY__ -static inline struct thread_info *current_thread_info(void) -{ - struct thread_info *ti; - ti = (void *)(read_pda(kernelstack) + PDA_STACKOFFSET - THREAD_SIZE); - return ti; -} +DECLARE_PER_CPU(unsigned long, kernel_stack); -/* do not use in interrupt context */ -static inline struct thread_info *stack_thread_info(void) +static inline struct thread_info *current_thread_info(void) { struct thread_info *ti; - asm("andq %%rsp,%0; " : "=r" (ti) : "0" (~(THREAD_SIZE - 1))); + ti = (void *)(percpu_read(kernel_stack) + + KERNEL_STACK_OFFSET - THREAD_SIZE); return ti; } @@ -220,8 +219,8 @@ static inline struct thread_info *stack_ /* how to get the thread information struct from ASM */ #define GET_THREAD_INFO(reg) \ - movq %gs:pda_kernelstack,reg ; \ - subq $(THREAD_SIZE-PDA_STACKOFFSET),reg + movq PER_CPU_VAR(kernel_stack),reg ; \ + subq $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg #endif Index: linux-2.6-tip/arch/x86/include/asm/timer.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/timer.h +++ linux-2.6-tip/arch/x86/include/asm/timer.h @@ -3,6 +3,7 @@ #include #include #include +#include #define TICK_SIZE (tick_nsec / 1000) @@ -12,6 +13,7 @@ unsigned long native_calibrate_tsc(void) #ifdef CONFIG_X86_32 extern int timer_ack; extern int recalibrate_cpu_khz(void); +extern irqreturn_t timer_interrupt(int irq, void *dev_id); #endif /* CONFIG_X86_32 */ extern int no_timer_check; @@ -56,9 +58,9 @@ static inline unsigned long long cycles_ unsigned long long ns; unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); ns = __cycles_2_ns(cyc); - local_irq_restore(flags); + raw_local_irq_restore(flags); return ns; } Index: linux-2.6-tip/arch/x86/include/asm/tlbflush.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/tlbflush.h +++ linux-2.6-tip/arch/x86/include/asm/tlbflush.h @@ -7,6 +7,21 @@ #include #include +/* + * TLB-flush needs to be nonpreemptible on PREEMPT_RT due to the + * following complex race scenario: + * + * if the current task is lazy-TLB and does a TLB flush and + * gets preempted after the movl %%r3, %0 but before the + * movl %0, %%cr3 then its ->active_mm might change and it will + * install the wrong cr3 when it switches back. This is not a + * problem for the lazy-TLB task itself, but if the next task it + * switches to has an ->mm that is also the lazy-TLB task's + * new ->active_mm, then the scheduler will assume that cr3 is + * the new one, while we overwrote it with the old one. The result + * is the wrong cr3 in the new (non-lazy-TLB) task, which typically + * causes an infinite pagefault upon the next userspace access. + */ #ifdef CONFIG_PARAVIRT #include #else @@ -17,7 +32,9 @@ static inline void __native_flush_tlb(void) { + preempt_disable(); write_cr3(read_cr3()); + preempt_enable(); } static inline void __native_flush_tlb_global(void) @@ -95,6 +112,13 @@ static inline void __flush_tlb_one(unsig static inline void flush_tlb_mm(struct mm_struct *mm) { + /* + * This is safe on PREEMPT_RT because if we preempt + * right after the check but before the __flush_tlb(), + * and if ->active_mm changes, then we might miss a + * TLB flush, but that TLB flush happened already when + * ->active_mm was changed: + */ if (mm == current->active_mm) __flush_tlb(); } @@ -113,7 +137,7 @@ static inline void flush_tlb_range(struc __flush_tlb(); } -static inline void native_flush_tlb_others(const cpumask_t *cpumask, +static inline void native_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, unsigned long va) { @@ -142,31 +166,28 @@ static inline void flush_tlb_range(struc flush_tlb_mm(vma->vm_mm); } -void native_flush_tlb_others(const cpumask_t *cpumask, struct mm_struct *mm, - unsigned long va); +void native_flush_tlb_others(const struct cpumask *cpumask, + struct mm_struct *mm, unsigned long va); #define TLBSTATE_OK 1 #define TLBSTATE_LAZY 2 -#ifdef CONFIG_X86_32 struct tlb_state { struct mm_struct *active_mm; int state; - char __cacheline_padding[L1_CACHE_BYTES-8]; }; DECLARE_PER_CPU(struct tlb_state, cpu_tlbstate); -void reset_lazy_tlbstate(void); -#else static inline void reset_lazy_tlbstate(void) { + percpu_write(cpu_tlbstate.state, 0); + percpu_write(cpu_tlbstate.active_mm, &init_mm); } -#endif #endif /* SMP */ #ifndef CONFIG_PARAVIRT -#define flush_tlb_others(mask, mm, va) native_flush_tlb_others(&mask, mm, va) +#define flush_tlb_others(mask, mm, va) native_flush_tlb_others(mask, mm, va) #endif static inline void flush_tlb_kernel_range(unsigned long start, @@ -175,4 +196,6 @@ static inline void flush_tlb_kernel_rang flush_tlb_all(); } +extern void zap_low_mappings(void); + #endif /* _ASM_X86_TLBFLUSH_H */ Index: linux-2.6-tip/arch/x86/include/asm/topology.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/topology.h +++ linux-2.6-tip/arch/x86/include/asm/topology.h @@ -74,6 +74,8 @@ static inline const struct cpumask *cpum return &node_to_cpumask_map[node]; } +static inline void setup_node_to_cpumask_map(void) { } + #else /* CONFIG_X86_64 */ /* Mappings between node number and cpus on that node. */ @@ -83,7 +85,8 @@ extern cpumask_t *node_to_cpumask_map; DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map); /* Returns the number of the current Node. */ -#define numa_node_id() read_pda(nodenumber) +DECLARE_PER_CPU(int, node_number); +#define numa_node_id() percpu_read(node_number) #ifdef CONFIG_DEBUG_PER_CPU_MAPS extern int cpu_to_node(int cpu); @@ -102,10 +105,7 @@ static inline int cpu_to_node(int cpu) /* Same function but used if called before per_cpu areas are setup */ static inline int early_cpu_to_node(int cpu) { - if (early_per_cpu_ptr(x86_cpu_to_node_map)) - return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu]; - - return per_cpu(x86_cpu_to_node_map, cpu); + return early_per_cpu(x86_cpu_to_node_map, cpu); } /* Returns a pointer to the cpumask of CPUs on Node 'node'. */ @@ -122,6 +122,8 @@ static inline cpumask_t node_to_cpumask( #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ +extern void setup_node_to_cpumask_map(void); + /* * Replace default node_to_cpumask_ptr with optimized version * Deprecated: use "const struct cpumask *mask = cpumask_of_node(node)" @@ -192,9 +194,20 @@ extern int __node_distance(int, int); #else /* !CONFIG_NUMA */ -#define numa_node_id() 0 -#define cpu_to_node(cpu) 0 -#define early_cpu_to_node(cpu) 0 +static inline int numa_node_id(void) +{ + return 0; +} + +static inline int cpu_to_node(int cpu) +{ + return 0; +} + +static inline int early_cpu_to_node(int cpu) +{ + return 0; +} static inline const cpumask_t *cpumask_of_node(int node) { @@ -209,6 +222,8 @@ static inline int node_to_first_cpu(int return first_cpu(cpu_online_map); } +static inline void setup_node_to_cpumask_map(void) { } + /* * Replace default node_to_cpumask_ptr with optimized version * Deprecated: use "const struct cpumask *mask = cpumask_of_node(node)" Index: linux-2.6-tip/arch/x86/include/asm/trampoline.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/trampoline.h +++ linux-2.6-tip/arch/x86/include/asm/trampoline.h @@ -13,6 +13,7 @@ extern unsigned char *trampoline_base; extern unsigned long init_rsp; extern unsigned long initial_code; +extern unsigned long initial_gs; #define TRAMPOLINE_SIZE roundup(trampoline_end - trampoline_data, PAGE_SIZE) #define TRAMPOLINE_BASE 0x6000 Index: linux-2.6-tip/arch/x86/include/asm/traps.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/traps.h +++ linux-2.6-tip/arch/x86/include/asm/traps.h @@ -41,7 +41,7 @@ dotraplinkage void do_int3(struct pt_reg dotraplinkage void do_overflow(struct pt_regs *, long); dotraplinkage void do_bounds(struct pt_regs *, long); dotraplinkage void do_invalid_op(struct pt_regs *, long); -dotraplinkage void do_device_not_available(struct pt_regs); +dotraplinkage void do_device_not_available(struct pt_regs *, long); dotraplinkage void do_coprocessor_segment_overrun(struct pt_regs *, long); dotraplinkage void do_invalid_TSS(struct pt_regs *, long); dotraplinkage void do_segment_not_present(struct pt_regs *, long); Index: linux-2.6-tip/arch/x86/include/asm/uaccess.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/uaccess.h +++ linux-2.6-tip/arch/x86/include/asm/uaccess.h @@ -121,7 +121,7 @@ extern int __get_user_bad(void); #define __get_user_x(size, ret, x, ptr) \ asm volatile("call __get_user_" #size \ - : "=a" (ret),"=d" (x) \ + : "=a" (ret), "=d" (x) \ : "0" (ptr)) \ /* Careful: we have to cast the result to the type of the pointer @@ -181,12 +181,12 @@ extern int __get_user_bad(void); #define __put_user_x(size, x, ptr, __ret_pu) \ asm volatile("call __put_user_" #size : "=a" (__ret_pu) \ - :"0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") + : "0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") #ifdef CONFIG_X86_32 -#define __put_user_u64(x, addr, err) \ +#define __put_user_asm_u64(x, addr, err, errret) \ asm volatile("1: movl %%eax,0(%2)\n" \ "2: movl %%edx,4(%2)\n" \ "3:\n" \ @@ -197,14 +197,24 @@ extern int __get_user_bad(void); _ASM_EXTABLE(1b, 4b) \ _ASM_EXTABLE(2b, 4b) \ : "=r" (err) \ - : "A" (x), "r" (addr), "i" (-EFAULT), "0" (err)) + : "A" (x), "r" (addr), "i" (errret), "0" (err)) + +#define __put_user_asm_ex_u64(x, addr) \ + asm volatile("1: movl %%eax,0(%1)\n" \ + "2: movl %%edx,4(%1)\n" \ + "3:\n" \ + _ASM_EXTABLE(1b, 2b - 1b) \ + _ASM_EXTABLE(2b, 3b - 2b) \ + : : "A" (x), "r" (addr)) #define __put_user_x8(x, ptr, __ret_pu) \ asm volatile("call __put_user_8" : "=a" (__ret_pu) \ : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") #else -#define __put_user_u64(x, ptr, retval) \ - __put_user_asm(x, ptr, retval, "q", "", "Zr", -EFAULT) +#define __put_user_asm_u64(x, ptr, retval, errret) \ + __put_user_asm(x, ptr, retval, "q", "", "Zr", errret) +#define __put_user_asm_ex_u64(x, addr) \ + __put_user_asm_ex(x, addr, "q", "", "Zr") #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu) #endif @@ -276,10 +286,32 @@ do { \ __put_user_asm(x, ptr, retval, "w", "w", "ir", errret); \ break; \ case 4: \ - __put_user_asm(x, ptr, retval, "l", "k", "ir", errret);\ + __put_user_asm(x, ptr, retval, "l", "k", "ir", errret); \ break; \ case 8: \ - __put_user_u64((__typeof__(*ptr))(x), ptr, retval); \ + __put_user_asm_u64((__typeof__(*ptr))(x), ptr, retval, \ + errret); \ + break; \ + default: \ + __put_user_bad(); \ + } \ +} while (0) + +#define __put_user_size_ex(x, ptr, size) \ +do { \ + __chk_user_ptr(ptr); \ + switch (size) { \ + case 1: \ + __put_user_asm_ex(x, ptr, "b", "b", "iq"); \ + break; \ + case 2: \ + __put_user_asm_ex(x, ptr, "w", "w", "ir"); \ + break; \ + case 4: \ + __put_user_asm_ex(x, ptr, "l", "k", "ir"); \ + break; \ + case 8: \ + __put_user_asm_ex_u64((__typeof__(*ptr))(x), ptr); \ break; \ default: \ __put_user_bad(); \ @@ -311,9 +343,12 @@ do { \ #ifdef CONFIG_X86_32 #define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad() +#define __get_user_asm_ex_u64(x, ptr) (x) = __get_user_bad() #else #define __get_user_asm_u64(x, ptr, retval, errret) \ __get_user_asm(x, ptr, retval, "q", "", "=r", errret) +#define __get_user_asm_ex_u64(x, ptr) \ + __get_user_asm_ex(x, ptr, "q", "", "=r") #endif #define __get_user_size(x, ptr, size, retval, errret) \ @@ -350,6 +385,33 @@ do { \ : "=r" (err), ltype(x) \ : "m" (__m(addr)), "i" (errret), "0" (err)) +#define __get_user_size_ex(x, ptr, size) \ +do { \ + __chk_user_ptr(ptr); \ + switch (size) { \ + case 1: \ + __get_user_asm_ex(x, ptr, "b", "b", "=q"); \ + break; \ + case 2: \ + __get_user_asm_ex(x, ptr, "w", "w", "=r"); \ + break; \ + case 4: \ + __get_user_asm_ex(x, ptr, "l", "k", "=r"); \ + break; \ + case 8: \ + __get_user_asm_ex_u64(x, ptr); \ + break; \ + default: \ + (x) = __get_user_bad(); \ + } \ +} while (0) + +#define __get_user_asm_ex(x, addr, itype, rtype, ltype) \ + asm volatile("1: mov"itype" %1,%"rtype"0\n" \ + "2:\n" \ + _ASM_EXTABLE(1b, 2b - 1b) \ + : ltype(x) : "m" (__m(addr))) + #define __put_user_nocheck(x, ptr, size) \ ({ \ int __pu_err; \ @@ -385,6 +447,26 @@ struct __large_struct { unsigned long bu _ASM_EXTABLE(1b, 3b) \ : "=r"(err) \ : ltype(x), "m" (__m(addr)), "i" (errret), "0" (err)) + +#define __put_user_asm_ex(x, addr, itype, rtype, ltype) \ + asm volatile("1: mov"itype" %"rtype"0,%1\n" \ + "2:\n" \ + _ASM_EXTABLE(1b, 2b - 1b) \ + : : ltype(x), "m" (__m(addr))) + +/* + * uaccess_try and catch + */ +#define uaccess_try do { \ + int prev_err = current_thread_info()->uaccess_err; \ + current_thread_info()->uaccess_err = 0; \ + barrier(); + +#define uaccess_catch(err) \ + (err) |= current_thread_info()->uaccess_err; \ + current_thread_info()->uaccess_err = prev_err; \ +} while (0) + /** * __get_user: - Get a simple variable from user space, with less checking. * @x: Variable to store result. @@ -408,6 +490,7 @@ struct __large_struct { unsigned long bu #define __get_user(x, ptr) \ __get_user_nocheck((x), (ptr), sizeof(*(ptr))) + /** * __put_user: - Write a simple value into user space, with less checking. * @x: Value to copy to user space. @@ -435,6 +518,45 @@ struct __large_struct { unsigned long bu #define __put_user_unaligned __put_user /* + * {get|put}_user_try and catch + * + * get_user_try { + * get_user_ex(...); + * } get_user_catch(err) + */ +#define get_user_try uaccess_try +#define get_user_catch(err) uaccess_catch(err) + +#define get_user_ex(x, ptr) do { \ + unsigned long __gue_val; \ + __get_user_size_ex((__gue_val), (ptr), (sizeof(*(ptr)))); \ + (x) = (__force __typeof__(*(ptr)))__gue_val; \ +} while (0) + +#ifdef CONFIG_X86_WP_WORKS_OK + +#define put_user_try uaccess_try +#define put_user_catch(err) uaccess_catch(err) + +#define put_user_ex(x, ptr) \ + __put_user_size_ex((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr))) + +#else /* !CONFIG_X86_WP_WORKS_OK */ + +#define put_user_try do { \ + int __uaccess_err = 0; + +#define put_user_catch(err) \ + (err) |= __uaccess_err; \ +} while (0) + +#define put_user_ex(x, ptr) do { \ + __uaccess_err |= __put_user(x, ptr); \ +} while (0) + +#endif /* CONFIG_X86_WP_WORKS_OK */ + +/* * movsl can be slow when source and dest are not both 8-byte aligned */ #ifdef CONFIG_X86_INTEL_USERCOPY Index: linux-2.6-tip/arch/x86/include/asm/unistd_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/unistd_32.h +++ linux-2.6-tip/arch/x86/include/asm/unistd_32.h @@ -338,6 +338,7 @@ #define __NR_dup3 330 #define __NR_pipe2 331 #define __NR_inotify_init1 332 +#define __NR_perf_counter_open 333 #ifdef __KERNEL__ Index: linux-2.6-tip/arch/x86/include/asm/unistd_64.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/unistd_64.h +++ linux-2.6-tip/arch/x86/include/asm/unistd_64.h @@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3) __SYSCALL(__NR_pipe2, sys_pipe2) #define __NR_inotify_init1 294 __SYSCALL(__NR_inotify_init1, sys_inotify_init1) - +#define __NR_perf_counter_open 295 +__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6-tip/arch/x86/include/asm/uv/uv.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/include/asm/uv/uv.h @@ -0,0 +1,36 @@ +#ifndef _ASM_X86_UV_UV_H +#define _ASM_X86_UV_UV_H + +enum uv_system_type {UV_NONE, UV_LEGACY_APIC, UV_X2APIC, UV_NON_UNIQUE_APIC}; + +struct cpumask; +struct mm_struct; + +#ifdef CONFIG_X86_UV + +extern enum uv_system_type get_uv_system_type(void); +extern int is_uv_system(void); +extern void uv_cpu_init(void); +extern void uv_system_init(void); +extern int uv_wakeup_secondary(int phys_apicid, unsigned int start_rip); +extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, + struct mm_struct *mm, + unsigned long va, + unsigned int cpu); + +#else /* X86_UV */ + +static inline enum uv_system_type get_uv_system_type(void) { return UV_NONE; } +static inline int is_uv_system(void) { return 0; } +static inline void uv_cpu_init(void) { } +static inline void uv_system_init(void) { } +static inline int uv_wakeup_secondary(int phys_apicid, unsigned int start_rip) +{ return 1; } +static inline const struct cpumask * +uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, + unsigned long va, unsigned int cpu) +{ return cpumask; } + +#endif /* X86_UV */ + +#endif /* _ASM_X86_UV_UV_H */ Index: linux-2.6-tip/arch/x86/include/asm/uv/uv_bau.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/uv/uv_bau.h +++ linux-2.6-tip/arch/x86/include/asm/uv/uv_bau.h @@ -325,7 +325,6 @@ static inline void bau_cpubits_clear(str #define cpubit_isset(cpu, bau_local_cpumask) \ test_bit((cpu), (bau_local_cpumask).bits) -extern int uv_flush_tlb_others(cpumask_t *, struct mm_struct *, unsigned long); extern void uv_bau_message_intr1(void); extern void uv_bau_timeout_intr1(void); Index: linux-2.6-tip/arch/x86/include/asm/vic.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/vic.h +++ /dev/null @@ -1,61 +0,0 @@ -/* Copyright (C) 1999,2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * Standard include definitions for the NCR Voyager Interrupt Controller */ - -/* The eight CPI vectors. To activate a CPI, you write a bit mask - * corresponding to the processor set to be interrupted into the - * relevant register. That set of CPUs will then be interrupted with - * the CPI */ -static const int VIC_CPI_Registers[] = - {0xFC00, 0xFC01, 0xFC08, 0xFC09, - 0xFC10, 0xFC11, 0xFC18, 0xFC19 }; - -#define VIC_PROC_WHO_AM_I 0xfc29 -# define QUAD_IDENTIFIER 0xC0 -# define EIGHT_SLOT_IDENTIFIER 0xE0 -#define QIC_EXTENDED_PROCESSOR_SELECT 0xFC72 -#define VIC_CPI_BASE_REGISTER 0xFC41 -#define VIC_PROCESSOR_ID 0xFC21 -# define VIC_CPU_MASQUERADE_ENABLE 0x8 - -#define VIC_CLAIM_REGISTER_0 0xFC38 -#define VIC_CLAIM_REGISTER_1 0xFC39 -#define VIC_REDIRECT_REGISTER_0 0xFC60 -#define VIC_REDIRECT_REGISTER_1 0xFC61 -#define VIC_PRIORITY_REGISTER 0xFC20 - -#define VIC_PRIMARY_MC_BASE 0xFC48 -#define VIC_SECONDARY_MC_BASE 0xFC49 - -#define QIC_PROCESSOR_ID 0xFC71 -# define QIC_CPUID_ENABLE 0x08 - -#define QIC_VIC_CPI_BASE_REGISTER 0xFC79 -#define QIC_CPI_BASE_REGISTER 0xFC7A - -#define QIC_MASK_REGISTER0 0xFC80 -/* NOTE: these are masked high, enabled low */ -# define QIC_PERF_TIMER 0x01 -# define QIC_LPE 0x02 -# define QIC_SYS_INT 0x04 -# define QIC_CMN_INT 0x08 -/* at the moment, just enable CMN_INT, disable SYS_INT */ -# define QIC_DEFAULT_MASK0 (~(QIC_CMN_INT /* | VIC_SYS_INT */)) -#define QIC_MASK_REGISTER1 0xFC81 -# define QIC_BOOT_CPI_MASK 0xFE -/* Enable CPI's 1-6 inclusive */ -# define QIC_CPI_ENABLE 0x81 - -#define QIC_INTERRUPT_CLEAR0 0xFC8A -#define QIC_INTERRUPT_CLEAR1 0xFC8B - -/* this is where we place the CPI vectors */ -#define VIC_DEFAULT_CPI_BASE 0xC0 -/* this is where we place the QIC CPI vectors */ -#define QIC_DEFAULT_CPI_BASE 0xD0 - -#define VIC_BOOT_INTERRUPT_MASK 0xfe - -extern void smp_vic_timer_interrupt(void); Index: linux-2.6-tip/arch/x86/include/asm/voyager.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/voyager.h +++ /dev/null @@ -1,529 +0,0 @@ -/* Copyright (C) 1999,2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * Standard include definitions for the NCR Voyager system */ - -#undef VOYAGER_DEBUG -#undef VOYAGER_CAT_DEBUG - -#ifdef VOYAGER_DEBUG -#define VDEBUG(x) printk x -#else -#define VDEBUG(x) -#endif - -/* There are three levels of voyager machine: 3,4 and 5. The rule is - * if it's less than 3435 it's a Level 3 except for a 3360 which is - * a level 4. A 3435 or above is a Level 5 */ -#define VOYAGER_LEVEL5_AND_ABOVE 0x3435 -#define VOYAGER_LEVEL4 0x3360 - -/* The L4 DINO ASIC */ -#define VOYAGER_DINO 0x43 - -/* voyager ports in standard I/O space */ -#define VOYAGER_MC_SETUP 0x96 - - -#define VOYAGER_CAT_CONFIG_PORT 0x97 -# define VOYAGER_CAT_DESELECT 0xff -#define VOYAGER_SSPB_RELOCATION_PORT 0x98 - -/* Valid CAT controller commands */ -/* start instruction register cycle */ -#define VOYAGER_CAT_IRCYC 0x01 -/* start data register cycle */ -#define VOYAGER_CAT_DRCYC 0x02 -/* move to execute state */ -#define VOYAGER_CAT_RUN 0x0F -/* end operation */ -#define VOYAGER_CAT_END 0x80 -/* hold in idle state */ -#define VOYAGER_CAT_HOLD 0x90 -/* single step an "intest" vector */ -#define VOYAGER_CAT_STEP 0xE0 -/* return cat controller to CLEMSON mode */ -#define VOYAGER_CAT_CLEMSON 0xFF - -/* the default cat command header */ -#define VOYAGER_CAT_HEADER 0x7F - -/* the range of possible CAT module ids in the system */ -#define VOYAGER_MIN_MODULE 0x10 -#define VOYAGER_MAX_MODULE 0x1f - -/* The voyager registers per asic */ -#define VOYAGER_ASIC_ID_REG 0x00 -#define VOYAGER_ASIC_TYPE_REG 0x01 -/* the sub address registers can be made auto incrementing on reads */ -#define VOYAGER_AUTO_INC_REG 0x02 -# define VOYAGER_AUTO_INC 0x04 -# define VOYAGER_NO_AUTO_INC 0xfb -#define VOYAGER_SUBADDRDATA 0x03 -#define VOYAGER_SCANPATH 0x05 -# define VOYAGER_CONNECT_ASIC 0x01 -# define VOYAGER_DISCONNECT_ASIC 0xfe -#define VOYAGER_SUBADDRLO 0x06 -#define VOYAGER_SUBADDRHI 0x07 -#define VOYAGER_SUBMODSELECT 0x08 -#define VOYAGER_SUBMODPRESENT 0x09 - -#define VOYAGER_SUBADDR_LO 0xff -#define VOYAGER_SUBADDR_HI 0xffff - -/* the maximum size of a scan path -- used to form instructions */ -#define VOYAGER_MAX_SCAN_PATH 0x100 -/* the biggest possible register size (in bytes) */ -#define VOYAGER_MAX_REG_SIZE 4 - -/* Total number of possible modules (including submodules) */ -#define VOYAGER_MAX_MODULES 16 -/* Largest number of asics per module */ -#define VOYAGER_MAX_ASICS_PER_MODULE 7 - -/* the CAT asic of each module is always the first one */ -#define VOYAGER_CAT_ID 0 -#define VOYAGER_PSI 0x1a - -/* voyager instruction operations and registers */ -#define VOYAGER_READ_CONFIG 0x1 -#define VOYAGER_WRITE_CONFIG 0x2 -#define VOYAGER_BYPASS 0xff - -typedef struct voyager_asic { - __u8 asic_addr; /* ASIC address; Level 4 */ - __u8 asic_type; /* ASIC type */ - __u8 asic_id; /* ASIC id */ - __u8 jtag_id[4]; /* JTAG id */ - __u8 asic_location; /* Location within scan path; start w/ 0 */ - __u8 bit_location; /* Location within bit stream; start w/ 0 */ - __u8 ireg_length; /* Instruction register length */ - __u16 subaddr; /* Amount of sub address space */ - struct voyager_asic *next; /* Next asic in linked list */ -} voyager_asic_t; - -typedef struct voyager_module { - __u8 module_addr; /* Module address */ - __u8 scan_path_connected; /* Scan path connected */ - __u16 ee_size; /* Size of the EEPROM */ - __u16 num_asics; /* Number of Asics */ - __u16 inst_bits; /* Instruction bits in the scan path */ - __u16 largest_reg; /* Largest register in the scan path */ - __u16 smallest_reg; /* Smallest register in the scan path */ - voyager_asic_t *asic; /* First ASIC in scan path (CAT_I) */ - struct voyager_module *submodule; /* Submodule pointer */ - struct voyager_module *next; /* Next module in linked list */ -} voyager_module_t; - -typedef struct voyager_eeprom_hdr { - __u8 module_id[4]; - __u8 version_id; - __u8 config_id; - __u16 boundry_id; /* boundary scan id */ - __u16 ee_size; /* size of EEPROM */ - __u8 assembly[11]; /* assembly # */ - __u8 assembly_rev; /* assembly rev */ - __u8 tracer[4]; /* tracer number */ - __u16 assembly_cksum; /* asm checksum */ - __u16 power_consump; /* pwr requirements */ - __u16 num_asics; /* number of asics */ - __u16 bist_time; /* min. bist time */ - __u16 err_log_offset; /* error log offset */ - __u16 scan_path_offset;/* scan path offset */ - __u16 cct_offset; - __u16 log_length; /* length of err log */ - __u16 xsum_end; /* offset to end of - checksum */ - __u8 reserved[4]; - __u8 sflag; /* starting sentinal */ - __u8 part_number[13]; /* prom part number */ - __u8 version[10]; /* version number */ - __u8 signature[8]; - __u16 eeprom_chksum; - __u32 data_stamp_offset; - __u8 eflag ; /* ending sentinal */ -} __attribute__((packed)) voyager_eprom_hdr_t; - - - -#define VOYAGER_EPROM_SIZE_OFFSET \ - ((__u16)(&(((voyager_eprom_hdr_t *)0)->ee_size))) -#define VOYAGER_XSUM_END_OFFSET 0x2a - -/* the following three definitions are for internal table layouts - * in the module EPROMs. We really only care about the IDs and - * offsets */ -typedef struct voyager_sp_table { - __u8 asic_id; - __u8 bypass_flag; - __u16 asic_data_offset; - __u16 config_data_offset; -} __attribute__((packed)) voyager_sp_table_t; - -typedef struct voyager_jtag_table { - __u8 icode[4]; - __u8 runbist[4]; - __u8 intest[4]; - __u8 samp_preld[4]; - __u8 ireg_len; -} __attribute__((packed)) voyager_jtt_t; - -typedef struct voyager_asic_data_table { - __u8 jtag_id[4]; - __u16 length_bsr; - __u16 length_bist_reg; - __u32 bist_clk; - __u16 subaddr_bits; - __u16 seed_bits; - __u16 sig_bits; - __u16 jtag_offset; -} __attribute__((packed)) voyager_at_t; - -/* Voyager Interrupt Controller (VIC) registers */ - -/* Base to add to Cross Processor Interrupts (CPIs) when triggering - * the CPU IRQ line */ -/* register defines for the WCBICs (one per processor) */ -#define VOYAGER_WCBIC0 0x41 /* bus A node P1 processor 0 */ -#define VOYAGER_WCBIC1 0x49 /* bus A node P1 processor 1 */ -#define VOYAGER_WCBIC2 0x51 /* bus A node P2 processor 0 */ -#define VOYAGER_WCBIC3 0x59 /* bus A node P2 processor 1 */ -#define VOYAGER_WCBIC4 0x61 /* bus B node P1 processor 0 */ -#define VOYAGER_WCBIC5 0x69 /* bus B node P1 processor 1 */ -#define VOYAGER_WCBIC6 0x71 /* bus B node P2 processor 0 */ -#define VOYAGER_WCBIC7 0x79 /* bus B node P2 processor 1 */ - - -/* top of memory registers */ -#define VOYAGER_WCBIC_TOM_L 0x4 -#define VOYAGER_WCBIC_TOM_H 0x5 - -/* register defines for Voyager Memory Contol (VMC) - * these are present on L4 machines only */ -#define VOYAGER_VMC1 0x81 -#define VOYAGER_VMC2 0x91 -#define VOYAGER_VMC3 0xa1 -#define VOYAGER_VMC4 0xb1 - -/* VMC Ports */ -#define VOYAGER_VMC_MEMORY_SETUP 0x9 -# define VMC_Interleaving 0x01 -# define VMC_4Way 0x02 -# define VMC_EvenCacheLines 0x04 -# define VMC_HighLine 0x08 -# define VMC_Start0_Enable 0x20 -# define VMC_Start1_Enable 0x40 -# define VMC_Vremap 0x80 -#define VOYAGER_VMC_BANK_DENSITY 0xa -# define VMC_BANK_EMPTY 0 -# define VMC_BANK_4MB 1 -# define VMC_BANK_16MB 2 -# define VMC_BANK_64MB 3 -# define VMC_BANK0_MASK 0x03 -# define VMC_BANK1_MASK 0x0C -# define VMC_BANK2_MASK 0x30 -# define VMC_BANK3_MASK 0xC0 - -/* Magellan Memory Controller (MMC) defines - present on L5 */ -#define VOYAGER_MMC_ASIC_ID 1 -/* the two memory modules corresponding to memory cards in the system */ -#define VOYAGER_MMC_MEMORY0_MODULE 0x14 -#define VOYAGER_MMC_MEMORY1_MODULE 0x15 -/* the Magellan Memory Address (MMA) defines */ -#define VOYAGER_MMA_ASIC_ID 2 - -/* Submodule number for the Quad Baseboard */ -#define VOYAGER_QUAD_BASEBOARD 1 - -/* ASIC defines for the Quad Baseboard */ -#define VOYAGER_QUAD_QDATA0 1 -#define VOYAGER_QUAD_QDATA1 2 -#define VOYAGER_QUAD_QABC 3 - -/* Useful areas in extended CMOS */ -#define VOYAGER_PROCESSOR_PRESENT_MASK 0x88a -#define VOYAGER_MEMORY_CLICKMAP 0xa23 -#define VOYAGER_DUMP_LOCATION 0xb1a - -/* SUS In Control bit - used to tell SUS that we don't need to be - * babysat anymore */ -#define VOYAGER_SUS_IN_CONTROL_PORT 0x3ff -# define VOYAGER_IN_CONTROL_FLAG 0x80 - -/* Voyager PSI defines */ -#define VOYAGER_PSI_STATUS_REG 0x08 -# define PSI_DC_FAIL 0x01 -# define PSI_MON 0x02 -# define PSI_FAULT 0x04 -# define PSI_ALARM 0x08 -# define PSI_CURRENT 0x10 -# define PSI_DVM 0x20 -# define PSI_PSCFAULT 0x40 -# define PSI_STAT_CHG 0x80 - -#define VOYAGER_PSI_SUPPLY_REG 0x8000 - /* read */ -# define PSI_FAIL_DC 0x01 -# define PSI_FAIL_AC 0x02 -# define PSI_MON_INT 0x04 -# define PSI_SWITCH_OFF 0x08 -# define PSI_HX_OFF 0x10 -# define PSI_SECURITY 0x20 -# define PSI_CMOS_BATT_LOW 0x40 -# define PSI_CMOS_BATT_FAIL 0x80 - /* write */ -# define PSI_CLR_SWITCH_OFF 0x13 -# define PSI_CLR_HX_OFF 0x14 -# define PSI_CLR_CMOS_BATT_FAIL 0x17 - -#define VOYAGER_PSI_MASK 0x8001 -# define PSI_MASK_MASK 0x10 - -#define VOYAGER_PSI_AC_FAIL_REG 0x8004 -#define AC_FAIL_STAT_CHANGE 0x80 - -#define VOYAGER_PSI_GENERAL_REG 0x8007 - /* read */ -# define PSI_SWITCH_ON 0x01 -# define PSI_SWITCH_ENABLED 0x02 -# define PSI_ALARM_ENABLED 0x08 -# define PSI_SECURE_ENABLED 0x10 -# define PSI_COLD_RESET 0x20 -# define PSI_COLD_START 0x80 - /* write */ -# define PSI_POWER_DOWN 0x10 -# define PSI_SWITCH_DISABLE 0x01 -# define PSI_SWITCH_ENABLE 0x11 -# define PSI_CLEAR 0x12 -# define PSI_ALARM_DISABLE 0x03 -# define PSI_ALARM_ENABLE 0x13 -# define PSI_CLEAR_COLD_RESET 0x05 -# define PSI_SET_COLD_RESET 0x15 -# define PSI_CLEAR_COLD_START 0x07 -# define PSI_SET_COLD_START 0x17 - - - -struct voyager_bios_info { - __u8 len; - __u8 major; - __u8 minor; - __u8 debug; - __u8 num_classes; - __u8 class_1; - __u8 class_2; -}; - -/* The following structures and definitions are for the Kernel/SUS - * interface these are needed to find out how SUS initialised any Quad - * boards in the system */ - -#define NUMBER_OF_MC_BUSSES 2 -#define SLOTS_PER_MC_BUS 8 -#define MAX_CPUS 16 /* 16 way CPU system */ -#define MAX_PROCESSOR_BOARDS 4 /* 4 processor slot system */ -#define MAX_CACHE_LEVELS 4 /* # of cache levels supported */ -#define MAX_SHARED_CPUS 4 /* # of CPUs that can share a LARC */ -#define NUMBER_OF_POS_REGS 8 - -typedef struct { - __u8 MC_Slot; - __u8 POS_Values[NUMBER_OF_POS_REGS]; -} __attribute__((packed)) MC_SlotInformation_t; - -struct QuadDescription { - __u8 Type; /* for type 0 (DYADIC or MONADIC) all fields - * will be zero except for slot */ - __u8 StructureVersion; - __u32 CPI_BaseAddress; - __u32 LARC_BankSize; - __u32 LocalMemoryStateBits; - __u8 Slot; /* Processor slots 1 - 4 */ -} __attribute__((packed)); - -struct ProcBoardInfo { - __u8 Type; - __u8 StructureVersion; - __u8 NumberOfBoards; - struct QuadDescription QuadData[MAX_PROCESSOR_BOARDS]; -} __attribute__((packed)); - -struct CacheDescription { - __u8 Level; - __u32 TotalSize; - __u16 LineSize; - __u8 Associativity; - __u8 CacheType; - __u8 WriteType; - __u8 Number_CPUs_SharedBy; - __u8 Shared_CPUs_Hardware_IDs[MAX_SHARED_CPUS]; - -} __attribute__((packed)); - -struct CPU_Description { - __u8 CPU_HardwareId; - char *FRU_String; - __u8 NumberOfCacheLevels; - struct CacheDescription CacheLevelData[MAX_CACHE_LEVELS]; -} __attribute__((packed)); - -struct CPU_Info { - __u8 Type; - __u8 StructureVersion; - __u8 NumberOf_CPUs; - struct CPU_Description CPU_Data[MAX_CPUS]; -} __attribute__((packed)); - - -/* - * This structure will be used by SUS and the OS. - * The assumption about this structure is that no blank space is - * packed in it by our friend the compiler. - */ -typedef struct { - __u8 Mailbox_SUS; /* Written to by SUS to give - commands/response to the OS */ - __u8 Mailbox_OS; /* Written to by the OS to give - commands/response to SUS */ - __u8 SUS_MailboxVersion; /* Tells the OS which iteration of the - interface SUS supports */ - __u8 OS_MailboxVersion; /* Tells SUS which iteration of the - interface the OS supports */ - __u32 OS_Flags; /* Flags set by the OS as info for - SUS */ - __u32 SUS_Flags; /* Flags set by SUS as info - for the OS */ - __u32 WatchDogPeriod; /* Watchdog period (in seconds) which - the DP uses to see if the OS - is dead */ - __u32 WatchDogCount; /* Updated by the OS on every tic. */ - __u32 MemoryFor_SUS_ErrorLog; /* Flat 32 bit address which tells SUS - where to stuff the SUS error log - on a dump */ - MC_SlotInformation_t MC_SlotInfo[NUMBER_OF_MC_BUSSES*SLOTS_PER_MC_BUS]; - /* Storage for MCA POS data */ - /* All new SECOND_PASS_INTERFACE fields added from this point */ - struct ProcBoardInfo *BoardData; - struct CPU_Info *CPU_Data; - /* All new fields must be added from this point */ -} Voyager_KernelSUS_Mbox_t; - -/* structure for finding the right memory address to send a QIC CPI to */ -struct voyager_qic_cpi { - /* Each cache line (32 bytes) can trigger a cpi. The cpi - * read/write may occur anywhere in the cache line---pick the - * middle to be safe */ - struct { - __u32 pad1[3]; - __u32 cpi; - __u32 pad2[4]; - } qic_cpi[8]; -}; - -struct voyager_status { - __u32 power_fail:1; - __u32 switch_off:1; - __u32 request_from_kernel:1; -}; - -struct voyager_psi_regs { - __u8 cat_id; - __u8 cat_dev; - __u8 cat_control; - __u8 subaddr; - __u8 dummy4; - __u8 checkbit; - __u8 subaddr_low; - __u8 subaddr_high; - __u8 intstatus; - __u8 stat1; - __u8 stat3; - __u8 fault; - __u8 tms; - __u8 gen; - __u8 sysconf; - __u8 dummy15; -}; - -struct voyager_psi_subregs { - __u8 supply; - __u8 mask; - __u8 present; - __u8 DCfail; - __u8 ACfail; - __u8 fail; - __u8 UPSfail; - __u8 genstatus; -}; - -struct voyager_psi { - struct voyager_psi_regs regs; - struct voyager_psi_subregs subregs; -}; - -struct voyager_SUS { -#define VOYAGER_DUMP_BUTTON_NMI 0x1 -#define VOYAGER_SUS_VALID 0x2 -#define VOYAGER_SYSINT_COMPLETE 0x3 - __u8 SUS_mbox; -#define VOYAGER_NO_COMMAND 0x0 -#define VOYAGER_IGNORE_DUMP 0x1 -#define VOYAGER_DO_DUMP 0x2 -#define VOYAGER_SYSINT_HANDSHAKE 0x3 -#define VOYAGER_DO_MEM_DUMP 0x4 -#define VOYAGER_SYSINT_WAS_RECOVERED 0x5 - __u8 kernel_mbox; -#define VOYAGER_MAILBOX_VERSION 0x10 - __u8 SUS_version; - __u8 kernel_version; -#define VOYAGER_OS_HAS_SYSINT 0x1 -#define VOYAGER_OS_IN_PROGRESS 0x2 -#define VOYAGER_UPDATING_WDPERIOD 0x4 - __u32 kernel_flags; -#define VOYAGER_SUS_BOOTING 0x1 -#define VOYAGER_SUS_IN_PROGRESS 0x2 - __u32 SUS_flags; - __u32 watchdog_period; - __u32 watchdog_count; - __u32 SUS_errorlog; - /* lots of system configuration stuff under here */ -}; - -/* Variables exported by voyager_smp */ -extern __u32 voyager_extended_vic_processors; -extern __u32 voyager_allowed_boot_processors; -extern __u32 voyager_quad_processors; -extern struct voyager_qic_cpi *voyager_quad_cpi_addr[NR_CPUS]; -extern struct voyager_SUS *voyager_SUS; - -/* variables exported always */ -extern struct task_struct *voyager_thread; -extern int voyager_level; -extern struct voyager_status voyager_status; - -/* functions exported by the voyager and voyager_smp modules */ -extern int voyager_cat_readb(__u8 module, __u8 asic, int reg); -extern void voyager_cat_init(void); -extern void voyager_detect(struct voyager_bios_info *); -extern void voyager_trap_init(void); -extern void voyager_setup_irqs(void); -extern int voyager_memory_detect(int region, __u32 *addr, __u32 *length); -extern void voyager_smp_intr_init(void); -extern __u8 voyager_extended_cmos_read(__u16 cmos_address); -extern void voyager_smp_dump(void); -extern void voyager_timer_interrupt(void); -extern void smp_local_timer_interrupt(void); -extern void voyager_power_off(void); -extern void smp_voyager_power_off(void *dummy); -extern void voyager_restart(void); -extern void voyager_cat_power_off(void); -extern void voyager_cat_do_common_interrupt(void); -extern void voyager_handle_nmi(void); -extern void voyager_smp_intr_init(void); -/* Commands for the following are */ -#define VOYAGER_PSI_READ 0 -#define VOYAGER_PSI_WRITE 1 -#define VOYAGER_PSI_SUBREAD 2 -#define VOYAGER_PSI_SUBWRITE 3 -extern void voyager_cat_psi(__u8, __u16, __u8 *); Index: linux-2.6-tip/arch/x86/include/asm/xen/events.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/xen/events.h +++ linux-2.6-tip/arch/x86/include/asm/xen/events.h @@ -15,10 +15,4 @@ static inline int xen_irqs_disabled(stru return raw_irqs_disabled_flags(regs->flags); } -static inline void xen_do_IRQ(int irq, struct pt_regs *regs) -{ - regs->orig_ax = ~irq; - do_IRQ(regs); -} - #endif /* _ASM_X86_XEN_EVENTS_H */ Index: linux-2.6-tip/arch/x86/include/asm/xen/hypervisor.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/xen/hypervisor.h +++ linux-2.6-tip/arch/x86/include/asm/xen/hypervisor.h @@ -38,22 +38,30 @@ extern struct shared_info *HYPERVISOR_sh extern struct start_info *xen_start_info; enum xen_domain_type { - XEN_NATIVE, - XEN_PV_DOMAIN, - XEN_HVM_DOMAIN, + XEN_NATIVE, /* running on bare hardware */ + XEN_PV_DOMAIN, /* running in a PV domain */ + XEN_HVM_DOMAIN, /* running in a Xen hvm domain */ }; -extern enum xen_domain_type xen_domain_type; - #ifdef CONFIG_XEN -#define xen_domain() (xen_domain_type != XEN_NATIVE) +extern enum xen_domain_type xen_domain_type; #else -#define xen_domain() (0) +#define xen_domain_type XEN_NATIVE #endif -#define xen_pv_domain() (xen_domain() && xen_domain_type == XEN_PV_DOMAIN) -#define xen_hvm_domain() (xen_domain() && xen_domain_type == XEN_HVM_DOMAIN) - -#define xen_initial_domain() (xen_pv_domain() && xen_start_info->flags & SIF_INITDOMAIN) +#define xen_domain() (xen_domain_type != XEN_NATIVE) +#define xen_pv_domain() (xen_domain() && \ + xen_domain_type == XEN_PV_DOMAIN) +#define xen_hvm_domain() (xen_domain() && \ + xen_domain_type == XEN_HVM_DOMAIN) + +#ifdef CONFIG_XEN_DOM0 +#include + +#define xen_initial_domain() (xen_pv_domain() && \ + xen_start_info->flags & SIF_INITDOMAIN) +#else /* !CONFIG_XEN_DOM0 */ +#define xen_initial_domain() (0) +#endif /* CONFIG_XEN_DOM0 */ #endif /* _ASM_X86_XEN_HYPERVISOR_H */ Index: linux-2.6-tip/arch/x86/kernel/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/Makefile +++ linux-2.6-tip/arch/x86/kernel/Makefile @@ -23,11 +23,12 @@ nostackp := $(call cc-option, -fno-stack CFLAGS_vsyscall_64.o := $(PROFILING) -g0 $(nostackp) CFLAGS_hpet.o := $(nostackp) CFLAGS_tsc.o := $(nostackp) +CFLAGS_paravirt.o := $(nostackp) obj-y := process_$(BITS).o signal.o entry_$(BITS).o obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o obj-y += time_$(BITS).o ioport.o ldt.o dumpstack.o -obj-y += setup.o i8259.o irqinit_$(BITS).o setup_percpu.o +obj-y += setup.o i8259.o irqinit_$(BITS).o obj-$(CONFIG_X86_VISWS) += visws_quirks.o obj-$(CONFIG_X86_32) += probe_roms_32.o obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o @@ -49,30 +50,26 @@ obj-y += step.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += cpu/ obj-y += acpi/ -obj-$(CONFIG_X86_BIOS_REBOOT) += reboot.o +obj-y += reboot.o obj-$(CONFIG_MCA) += mca_32.o obj-$(CONFIG_X86_MSR) += msr.o obj-$(CONFIG_X86_CPUID) += cpuid.o obj-$(CONFIG_PCI) += early-quirks.o apm-y := apm_32.o obj-$(CONFIG_APM) += apm.o -obj-$(CONFIG_X86_SMP) += smp.o -obj-$(CONFIG_X86_SMP) += smpboot.o tsc_sync.o ipi.o tlb_$(BITS).o -obj-$(CONFIG_X86_32_SMP) += smpcommon.o -obj-$(CONFIG_X86_64_SMP) += tsc_sync.o smpcommon.o +obj-$(CONFIG_SMP) += smp.o +obj-$(CONFIG_SMP) += smpboot.o tsc_sync.o +obj-$(CONFIG_SMP) += setup_percpu.o +obj-$(CONFIG_X86_64_SMP) += tsc_sync.o obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_$(BITS).o obj-$(CONFIG_X86_MPPARSE) += mpparse.o -obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o -obj-$(CONFIG_X86_IO_APIC) += io_apic.o +obj-y += apic/ obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o -obj-$(CONFIG_X86_NUMAQ) += numaq_32.o -obj-$(CONFIG_X86_ES7000) += es7000_32.o -obj-$(CONFIG_X86_SUMMIT_NUMA) += summit_32.o obj-y += vsmp_64.o obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_MODULES) += module_$(BITS).o @@ -109,21 +106,18 @@ obj-$(CONFIG_MICROCODE) += microcode.o obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o -obj-$(CONFIG_SWIOTLB) += pci-swiotlb_64.o # NB rename without _64 +obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o ### # 64 bit specific files ifeq ($(CONFIG_X86_64),y) - obj-y += genapic_64.o genapic_flat_64.o genx2apic_uv_x.o tlb_uv.o - obj-y += bios_uv.o uv_irq.o uv_sysfs.o - obj-y += genx2apic_cluster.o - obj-y += genx2apic_phys.o - obj-$(CONFIG_X86_PM_TIMER) += pmtimer_64.o - obj-$(CONFIG_AUDIT) += audit_64.o - - obj-$(CONFIG_GART_IOMMU) += pci-gart_64.o aperture_64.o - obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o - obj-$(CONFIG_AMD_IOMMU) += amd_iommu_init.o amd_iommu.o + obj-$(CONFIG_X86_UV) += tlb_uv.o bios_uv.o uv_irq.o uv_sysfs.o + obj-$(CONFIG_X86_PM_TIMER) += pmtimer_64.o + obj-$(CONFIG_AUDIT) += audit_64.o + + obj-$(CONFIG_GART_IOMMU) += pci-gart_64.o aperture_64.o + obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o + obj-$(CONFIG_AMD_IOMMU) += amd_iommu_init.o amd_iommu.o - obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o + obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o endif Index: linux-2.6-tip/arch/x86/kernel/acpi/boot.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/boot.c +++ linux-2.6-tip/arch/x86/kernel/acpi/boot.c @@ -37,15 +37,10 @@ #include #include #include -#include #include #include #include -#ifdef CONFIG_X86_LOCAL_APIC -# include -#endif - static int __initdata acpi_force = 0; u32 acpi_rsdt_forced; #ifdef CONFIG_ACPI @@ -56,16 +51,7 @@ int acpi_disabled = 1; EXPORT_SYMBOL(acpi_disabled); #ifdef CONFIG_X86_64 - -#include - -#else /* X86 */ - -#ifdef CONFIG_X86_LOCAL_APIC -#include -#include -#endif /* CONFIG_X86_LOCAL_APIC */ - +# include #endif /* X86 */ #define BAD_MADT_ENTRY(entry, end) ( \ @@ -121,35 +107,18 @@ enum acpi_irq_model_id acpi_irq_model = */ char *__init __acpi_map_table(unsigned long phys, unsigned long size) { - unsigned long base, offset, mapped_size; - int idx; if (!phys || !size) return NULL; - if (phys+size <= (max_low_pfn_mapped << PAGE_SHIFT)) - return __va(phys); - - offset = phys & (PAGE_SIZE - 1); - mapped_size = PAGE_SIZE - offset; - clear_fixmap(FIX_ACPI_END); - set_fixmap(FIX_ACPI_END, phys); - base = fix_to_virt(FIX_ACPI_END); - - /* - * Most cases can be covered by the below. - */ - idx = FIX_ACPI_END; - while (mapped_size < size) { - if (--idx < FIX_ACPI_BEGIN) - return NULL; /* cannot handle this */ - phys += PAGE_SIZE; - clear_fixmap(idx); - set_fixmap(idx, phys); - mapped_size += PAGE_SIZE; - } + return early_ioremap(phys, size); +} +void __init __acpi_unmap_table(char *map, unsigned long size) +{ + if (!map || !size) + return; - return ((unsigned char *)base + offset); + early_iounmap(map, size); } #ifdef CONFIG_PCI_MMCONFIG @@ -239,7 +208,8 @@ static int __init acpi_parse_madt(struct madt->address); } - acpi_madt_oem_check(madt->header.oem_id, madt->header.oem_table_id); + default_acpi_madt_oem_check(madt->header.oem_id, + madt->header.oem_table_id); return 0; } @@ -884,7 +854,7 @@ static struct { DECLARE_BITMAP(pin_programmed, MP_MAX_IOAPIC_PIN + 1); } mp_ioapic_routing[MAX_IO_APICS]; -static int mp_find_ioapic(int gsi) +int mp_find_ioapic(int gsi) { int i = 0; @@ -899,6 +869,16 @@ static int mp_find_ioapic(int gsi) return -1; } +int mp_find_ioapic_pin(int ioapic, int gsi) +{ + if (WARN_ON(ioapic == -1)) + return -1; + if (WARN_ON(gsi > mp_ioapic_routing[ioapic].gsi_end)) + return -1; + + return gsi - mp_ioapic_routing[ioapic].gsi_base; +} + static u8 __init uniq_ioapic_id(u8 id) { #ifdef CONFIG_X86_32 @@ -912,8 +892,8 @@ static u8 __init uniq_ioapic_id(u8 id) DECLARE_BITMAP(used, 256); bitmap_zero(used, 256); for (i = 0; i < nr_ioapics; i++) { - struct mp_config_ioapic *ia = &mp_ioapics[i]; - __set_bit(ia->mp_apicid, used); + struct mpc_ioapic *ia = &mp_ioapics[i]; + __set_bit(ia->apicid, used); } if (!test_bit(id, used)) return id; @@ -945,29 +925,29 @@ void __init mp_register_ioapic(int id, u idx = nr_ioapics; - mp_ioapics[idx].mp_type = MP_IOAPIC; - mp_ioapics[idx].mp_flags = MPC_APIC_USABLE; - mp_ioapics[idx].mp_apicaddr = address; + mp_ioapics[idx].type = MP_IOAPIC; + mp_ioapics[idx].flags = MPC_APIC_USABLE; + mp_ioapics[idx].apicaddr = address; set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address); - mp_ioapics[idx].mp_apicid = uniq_ioapic_id(id); + mp_ioapics[idx].apicid = uniq_ioapic_id(id); #ifdef CONFIG_X86_32 - mp_ioapics[idx].mp_apicver = io_apic_get_version(idx); + mp_ioapics[idx].apicver = io_apic_get_version(idx); #else - mp_ioapics[idx].mp_apicver = 0; + mp_ioapics[idx].apicver = 0; #endif /* * Build basic GSI lookup table to facilitate gsi->io_apic lookups * and to prevent reprogramming of IOAPIC pins (PCI GSIs). */ - mp_ioapic_routing[idx].apic_id = mp_ioapics[idx].mp_apicid; + mp_ioapic_routing[idx].apic_id = mp_ioapics[idx].apicid; mp_ioapic_routing[idx].gsi_base = gsi_base; mp_ioapic_routing[idx].gsi_end = gsi_base + io_apic_get_redir_entries(idx); - printk(KERN_INFO "IOAPIC[%d]: apic_id %d, version %d, address 0x%lx, " - "GSI %d-%d\n", idx, mp_ioapics[idx].mp_apicid, - mp_ioapics[idx].mp_apicver, mp_ioapics[idx].mp_apicaddr, + printk(KERN_INFO "IOAPIC[%d]: apic_id %d, version %d, address 0x%x, " + "GSI %d-%d\n", idx, mp_ioapics[idx].apicid, + mp_ioapics[idx].apicver, mp_ioapics[idx].apicaddr, mp_ioapic_routing[idx].gsi_base, mp_ioapic_routing[idx].gsi_end); nr_ioapics++; @@ -996,19 +976,19 @@ int __init acpi_probe_gsi(void) return max_gsi + 1; } -static void assign_to_mp_irq(struct mp_config_intsrc *m, - struct mp_config_intsrc *mp_irq) +static void assign_to_mp_irq(struct mpc_intsrc *m, + struct mpc_intsrc *mp_irq) { - memcpy(mp_irq, m, sizeof(struct mp_config_intsrc)); + memcpy(mp_irq, m, sizeof(struct mpc_intsrc)); } -static int mp_irq_cmp(struct mp_config_intsrc *mp_irq, - struct mp_config_intsrc *m) +static int mp_irq_cmp(struct mpc_intsrc *mp_irq, + struct mpc_intsrc *m) { - return memcmp(mp_irq, m, sizeof(struct mp_config_intsrc)); + return memcmp(mp_irq, m, sizeof(struct mpc_intsrc)); } -static void save_mp_irq(struct mp_config_intsrc *m) +static void save_mp_irq(struct mpc_intsrc *m) { int i; @@ -1026,7 +1006,7 @@ void __init mp_override_legacy_irq(u8 bu { int ioapic; int pin; - struct mp_config_intsrc mp_irq; + struct mpc_intsrc mp_irq; /* * Convert 'gsi' to 'ioapic.pin'. @@ -1034,7 +1014,7 @@ void __init mp_override_legacy_irq(u8 bu ioapic = mp_find_ioapic(gsi); if (ioapic < 0) return; - pin = gsi - mp_ioapic_routing[ioapic].gsi_base; + pin = mp_find_ioapic_pin(ioapic, gsi); /* * TBD: This check is for faulty timer entries, where the override @@ -1044,13 +1024,13 @@ void __init mp_override_legacy_irq(u8 bu if ((bus_irq == 0) && (trigger == 3)) trigger = 1; - mp_irq.mp_type = MP_INTSRC; - mp_irq.mp_irqtype = mp_INT; - mp_irq.mp_irqflag = (trigger << 2) | polarity; - mp_irq.mp_srcbus = MP_ISA_BUS; - mp_irq.mp_srcbusirq = bus_irq; /* IRQ */ - mp_irq.mp_dstapic = mp_ioapics[ioapic].mp_apicid; /* APIC ID */ - mp_irq.mp_dstirq = pin; /* INTIN# */ + mp_irq.type = MP_INTSRC; + mp_irq.irqtype = mp_INT; + mp_irq.irqflag = (trigger << 2) | polarity; + mp_irq.srcbus = MP_ISA_BUS; + mp_irq.srcbusirq = bus_irq; /* IRQ */ + mp_irq.dstapic = mp_ioapics[ioapic].apicid; /* APIC ID */ + mp_irq.dstirq = pin; /* INTIN# */ save_mp_irq(&mp_irq); } @@ -1060,7 +1040,7 @@ void __init mp_config_acpi_legacy_irqs(v int i; int ioapic; unsigned int dstapic; - struct mp_config_intsrc mp_irq; + struct mpc_intsrc mp_irq; #if defined (CONFIG_MCA) || defined (CONFIG_EISA) /* @@ -1085,7 +1065,7 @@ void __init mp_config_acpi_legacy_irqs(v ioapic = mp_find_ioapic(0); if (ioapic < 0) return; - dstapic = mp_ioapics[ioapic].mp_apicid; + dstapic = mp_ioapics[ioapic].apicid; /* * Use the default configuration for the IRQs 0-15. Unless @@ -1095,16 +1075,14 @@ void __init mp_config_acpi_legacy_irqs(v int idx; for (idx = 0; idx < mp_irq_entries; idx++) { - struct mp_config_intsrc *irq = mp_irqs + idx; + struct mpc_intsrc *irq = mp_irqs + idx; /* Do we already have a mapping for this ISA IRQ? */ - if (irq->mp_srcbus == MP_ISA_BUS - && irq->mp_srcbusirq == i) + if (irq->srcbus == MP_ISA_BUS && irq->srcbusirq == i) break; /* Do we already have a mapping for this IOAPIC pin */ - if (irq->mp_dstapic == dstapic && - irq->mp_dstirq == i) + if (irq->dstapic == dstapic && irq->dstirq == i) break; } @@ -1113,13 +1091,13 @@ void __init mp_config_acpi_legacy_irqs(v continue; /* IRQ already used */ } - mp_irq.mp_type = MP_INTSRC; - mp_irq.mp_irqflag = 0; /* Conforming */ - mp_irq.mp_srcbus = MP_ISA_BUS; - mp_irq.mp_dstapic = dstapic; - mp_irq.mp_irqtype = mp_INT; - mp_irq.mp_srcbusirq = i; /* Identity mapped */ - mp_irq.mp_dstirq = i; + mp_irq.type = MP_INTSRC; + mp_irq.irqflag = 0; /* Conforming */ + mp_irq.srcbus = MP_ISA_BUS; + mp_irq.dstapic = dstapic; + mp_irq.irqtype = mp_INT; + mp_irq.srcbusirq = i; /* Identity mapped */ + mp_irq.dstirq = i; save_mp_irq(&mp_irq); } @@ -1156,7 +1134,7 @@ int mp_register_gsi(u32 gsi, int trigger return gsi; } - ioapic_pin = gsi - mp_ioapic_routing[ioapic].gsi_base; + ioapic_pin = mp_find_ioapic_pin(ioapic, gsi); #ifdef CONFIG_X86_32 if (ioapic_renumber_irq) @@ -1230,22 +1208,22 @@ int mp_config_acpi_gsi(unsigned char num u32 gsi, int triggering, int polarity) { #ifdef CONFIG_X86_MPPARSE - struct mp_config_intsrc mp_irq; + struct mpc_intsrc mp_irq; int ioapic; if (!acpi_ioapic) return 0; /* print the entry should happen on mptable identically */ - mp_irq.mp_type = MP_INTSRC; - mp_irq.mp_irqtype = mp_INT; - mp_irq.mp_irqflag = (triggering == ACPI_EDGE_SENSITIVE ? 4 : 0x0c) | + mp_irq.type = MP_INTSRC; + mp_irq.irqtype = mp_INT; + mp_irq.irqflag = (triggering == ACPI_EDGE_SENSITIVE ? 4 : 0x0c) | (polarity == ACPI_ACTIVE_HIGH ? 1 : 3); - mp_irq.mp_srcbus = number; - mp_irq.mp_srcbusirq = (((devfn >> 3) & 0x1f) << 2) | ((pin - 1) & 3); + mp_irq.srcbus = number; + mp_irq.srcbusirq = (((devfn >> 3) & 0x1f) << 2) | ((pin - 1) & 3); ioapic = mp_find_ioapic(gsi); - mp_irq.mp_dstapic = mp_ioapic_routing[ioapic].apic_id; - mp_irq.mp_dstirq = gsi - mp_ioapic_routing[ioapic].gsi_base; + mp_irq.dstapic = mp_ioapic_routing[ioapic].apic_id; + mp_irq.dstirq = mp_find_ioapic_pin(ioapic, gsi); save_mp_irq(&mp_irq); #endif @@ -1372,7 +1350,7 @@ static void __init acpi_process_madt(voi if (!error) { acpi_lapic = 1; -#ifdef CONFIG_X86_GENERICARCH +#ifdef CONFIG_X86_BIGSMP generic_bigsmp_probe(); #endif /* @@ -1384,9 +1362,8 @@ static void __init acpi_process_madt(voi acpi_ioapic = 1; smp_found_config = 1; -#ifdef CONFIG_X86_32 - setup_apic_routing(); -#endif + if (apic->setup_apic_routing) + apic->setup_apic_routing(); } } if (error == -EINVAL) { Index: linux-2.6-tip/arch/x86/kernel/acpi/processor.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/processor.c +++ linux-2.6-tip/arch/x86/kernel/acpi/processor.c @@ -43,6 +43,11 @@ static void init_intel_pdc(struct acpi_p buf[0] = ACPI_PDC_REVISION_ID; buf[1] = 1; buf[2] = ACPI_PDC_C_CAPABILITY_SMP; + /* + * If mwait/monitor is unsupported, C2/C3_FFH will be disabled. + */ + if (!cpu_has(c, X86_FEATURE_MWAIT)) + buf[2] &= ~ACPI_PDC_C_C2C3_FFH; /* * The default of PDC_SMP_T_SWCOORD bit is set for intel x86 cpu so Index: linux-2.6-tip/arch/x86/kernel/acpi/realmode/wakeup.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/realmode/wakeup.S +++ linux-2.6-tip/arch/x86/kernel/acpi/realmode/wakeup.S @@ -3,8 +3,8 @@ */ #include #include -#include -#include +#include +#include #include .code16 Index: linux-2.6-tip/arch/x86/kernel/acpi/sleep.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/sleep.c +++ linux-2.6-tip/arch/x86/kernel/acpi/sleep.c @@ -101,6 +101,7 @@ int acpi_save_state_mem(void) stack_start.sp = temp_stack + sizeof(temp_stack); early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(smp_processor_id()); + initial_gs = per_cpu_offset(smp_processor_id()); #endif initial_code = (unsigned long)wakeup_long64; saved_magic = 0x123456789abcdef0; Index: linux-2.6-tip/arch/x86/kernel/acpi/wakeup_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/wakeup_32.S +++ linux-2.6-tip/arch/x86/kernel/acpi/wakeup_32.S @@ -1,7 +1,7 @@ .section .text.page_aligned #include #include -#include +#include # Copyright 2003, 2008 Pavel Machek , distribute under GPLv2 Index: linux-2.6-tip/arch/x86/kernel/acpi/wakeup_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/acpi/wakeup_64.S +++ linux-2.6-tip/arch/x86/kernel/acpi/wakeup_64.S @@ -1,8 +1,8 @@ .text #include #include -#include -#include +#include +#include #include #include Index: linux-2.6-tip/arch/x86/kernel/alternative.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/alternative.c +++ linux-2.6-tip/arch/x86/kernel/alternative.c @@ -414,9 +414,17 @@ void __init alternative_instructions(voi that might execute the to be patched code. Other CPUs are not running. */ stop_nmi(); -#ifdef CONFIG_X86_MCE - stop_mce(); -#endif + + /* + * Don't stop machine check exceptions while patching. + * MCEs only happen when something got corrupted and in this + * case we must do something about the corruption. + * Ignoring it is worse than a unlikely patching race. + * Also machine checks tend to be broadcast and if one CPU + * goes into machine check the others follow quickly, so we don't + * expect a machine check to cause undue problems during to code + * patching. + */ apply_alternatives(__alt_instructions, __alt_instructions_end); @@ -456,9 +464,6 @@ void __init alternative_instructions(voi (unsigned long)__smp_locks_end); restart_nmi(); -#ifdef CONFIG_X86_MCE - restart_mce(); -#endif } /** Index: linux-2.6-tip/arch/x86/kernel/amd_iommu.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/amd_iommu.c +++ linux-2.6-tip/arch/x86/kernel/amd_iommu.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #ifdef CONFIG_IOMMU_API #include @@ -1297,8 +1298,10 @@ static void __unmap_single(struct amd_io /* * The exported map_single function for dma_ops. */ -static dma_addr_t map_single(struct device *dev, phys_addr_t paddr, - size_t size, int dir) +static dma_addr_t map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long flags; struct amd_iommu *iommu; @@ -1306,6 +1309,7 @@ static dma_addr_t map_single(struct devi u16 devid; dma_addr_t addr; u64 dma_mask; + phys_addr_t paddr = page_to_phys(page) + offset; INC_STATS_COUNTER(cnt_map_single); @@ -1340,8 +1344,8 @@ out: /* * The exported unmap_single function for dma_ops. */ -static void unmap_single(struct device *dev, dma_addr_t dma_addr, - size_t size, int dir) +static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction dir, struct dma_attrs *attrs) { unsigned long flags; struct amd_iommu *iommu; @@ -1390,7 +1394,8 @@ static int map_sg_no_iommu(struct device * lists). */ static int map_sg(struct device *dev, struct scatterlist *sglist, - int nelems, int dir) + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long flags; struct amd_iommu *iommu; @@ -1457,7 +1462,8 @@ unmap: * lists). */ static void unmap_sg(struct device *dev, struct scatterlist *sglist, - int nelems, int dir) + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long flags; struct amd_iommu *iommu; @@ -1644,11 +1650,11 @@ static void prealloc_protection_domains( } } -static struct dma_mapping_ops amd_iommu_dma_ops = { +static struct dma_map_ops amd_iommu_dma_ops = { .alloc_coherent = alloc_coherent, .free_coherent = free_coherent, - .map_single = map_single, - .unmap_single = unmap_single, + .map_page = map_page, + .unmap_page = unmap_page, .map_sg = map_sg, .unmap_sg = unmap_sg, .dma_supported = amd_iommu_dma_supported, Index: linux-2.6-tip/arch/x86/kernel/apic.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/apic.c +++ /dev/null @@ -1,2223 +0,0 @@ -/* - * Local APIC handling, local APIC timers - * - * (c) 1999, 2000 Ingo Molnar - * - * Fixes - * Maciej W. Rozycki : Bits for genuine 82489DX APICs; - * thanks to Eric Gilmore - * and Rolf G. Tews - * for testing these extensively. - * Maciej W. Rozycki : Various updates and fixes. - * Mikael Pettersson : Power Management for UP-APIC. - * Pavel Machek and - * Mikael Pettersson : PM converted to driver model. - */ - -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -/* - * Sanity check - */ -#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F) -# error SPURIOUS_APIC_VECTOR definition error -#endif - -#ifdef CONFIG_X86_32 -/* - * Knob to control our willingness to enable the local APIC. - * - * +1=force-enable - */ -static int force_enable_local_apic; -/* - * APIC command line parameters - */ -static int __init parse_lapic(char *arg) -{ - force_enable_local_apic = 1; - return 0; -} -early_param("lapic", parse_lapic); -/* Local APIC was disabled by the BIOS and enabled by the kernel */ -static int enabled_via_apicbase; - -#endif - -#ifdef CONFIG_X86_64 -static int apic_calibrate_pmtmr __initdata; -static __init int setup_apicpmtimer(char *s) -{ - apic_calibrate_pmtmr = 1; - notsc_setup(NULL); - return 0; -} -__setup("apicpmtimer", setup_apicpmtimer); -#endif - -#ifdef CONFIG_X86_64 -#define HAVE_X2APIC -#endif - -#ifdef HAVE_X2APIC -int x2apic; -/* x2apic enabled before OS handover */ -static int x2apic_preenabled; -static int disable_x2apic; -static __init int setup_nox2apic(char *str) -{ - disable_x2apic = 1; - setup_clear_cpu_cap(X86_FEATURE_X2APIC); - return 0; -} -early_param("nox2apic", setup_nox2apic); -#endif - -unsigned long mp_lapic_addr; -int disable_apic; -/* Disable local APIC timer from the kernel commandline or via dmi quirk */ -static int disable_apic_timer __cpuinitdata; -/* Local APIC timer works in C2 */ -int local_apic_timer_c2_ok; -EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok); - -int first_system_vector = 0xfe; - -/* - * Debug level, exported for io_apic.c - */ -unsigned int apic_verbosity; - -int pic_mode; - -/* Have we found an MP table */ -int smp_found_config; - -static struct resource lapic_resource = { - .name = "Local APIC", - .flags = IORESOURCE_MEM | IORESOURCE_BUSY, -}; - -static unsigned int calibration_result; - -static int lapic_next_event(unsigned long delta, - struct clock_event_device *evt); -static void lapic_timer_setup(enum clock_event_mode mode, - struct clock_event_device *evt); -static void lapic_timer_broadcast(const struct cpumask *mask); -static void apic_pm_activate(void); - -/* - * The local apic timer can be used for any function which is CPU local. - */ -static struct clock_event_device lapic_clockevent = { - .name = "lapic", - .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT - | CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_DUMMY, - .shift = 32, - .set_mode = lapic_timer_setup, - .set_next_event = lapic_next_event, - .broadcast = lapic_timer_broadcast, - .rating = 100, - .irq = -1, -}; -static DEFINE_PER_CPU(struct clock_event_device, lapic_events); - -static unsigned long apic_phys; - -/* - * Get the LAPIC version - */ -static inline int lapic_get_version(void) -{ - return GET_APIC_VERSION(apic_read(APIC_LVR)); -} - -/* - * Check, if the APIC is integrated or a separate chip - */ -static inline int lapic_is_integrated(void) -{ -#ifdef CONFIG_X86_64 - return 1; -#else - return APIC_INTEGRATED(lapic_get_version()); -#endif -} - -/* - * Check, whether this is a modern or a first generation APIC - */ -static int modern_apic(void) -{ - /* AMD systems use old APIC versions, so check the CPU */ - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && - boot_cpu_data.x86 >= 0xf) - return 1; - return lapic_get_version() >= 0x14; -} - -/* - * Paravirt kernels also might be using these below ops. So we still - * use generic apic_read()/apic_write(), which might be pointing to different - * ops in PARAVIRT case. - */ -void xapic_wait_icr_idle(void) -{ - while (apic_read(APIC_ICR) & APIC_ICR_BUSY) - cpu_relax(); -} - -u32 safe_xapic_wait_icr_idle(void) -{ - u32 send_status; - int timeout; - - timeout = 0; - do { - send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY; - if (!send_status) - break; - udelay(100); - } while (timeout++ < 1000); - - return send_status; -} - -void xapic_icr_write(u32 low, u32 id) -{ - apic_write(APIC_ICR2, SET_APIC_DEST_FIELD(id)); - apic_write(APIC_ICR, low); -} - -static u64 xapic_icr_read(void) -{ - u32 icr1, icr2; - - icr2 = apic_read(APIC_ICR2); - icr1 = apic_read(APIC_ICR); - - return icr1 | ((u64)icr2 << 32); -} - -static struct apic_ops xapic_ops = { - .read = native_apic_mem_read, - .write = native_apic_mem_write, - .icr_read = xapic_icr_read, - .icr_write = xapic_icr_write, - .wait_icr_idle = xapic_wait_icr_idle, - .safe_wait_icr_idle = safe_xapic_wait_icr_idle, -}; - -struct apic_ops __read_mostly *apic_ops = &xapic_ops; -EXPORT_SYMBOL_GPL(apic_ops); - -#ifdef HAVE_X2APIC -static void x2apic_wait_icr_idle(void) -{ - /* no need to wait for icr idle in x2apic */ - return; -} - -static u32 safe_x2apic_wait_icr_idle(void) -{ - /* no need to wait for icr idle in x2apic */ - return 0; -} - -void x2apic_icr_write(u32 low, u32 id) -{ - wrmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), ((__u64) id) << 32 | low); -} - -static u64 x2apic_icr_read(void) -{ - unsigned long val; - - rdmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), val); - return val; -} - -static struct apic_ops x2apic_ops = { - .read = native_apic_msr_read, - .write = native_apic_msr_write, - .icr_read = x2apic_icr_read, - .icr_write = x2apic_icr_write, - .wait_icr_idle = x2apic_wait_icr_idle, - .safe_wait_icr_idle = safe_x2apic_wait_icr_idle, -}; -#endif - -/** - * enable_NMI_through_LVT0 - enable NMI through local vector table 0 - */ -void __cpuinit enable_NMI_through_LVT0(void) -{ - unsigned int v; - - /* unmask and set to NMI */ - v = APIC_DM_NMI; - - /* Level triggered for 82489DX (32bit mode) */ - if (!lapic_is_integrated()) - v |= APIC_LVT_LEVEL_TRIGGER; - - apic_write(APIC_LVT0, v); -} - -#ifdef CONFIG_X86_32 -/** - * get_physical_broadcast - Get number of physical broadcast IDs - */ -int get_physical_broadcast(void) -{ - return modern_apic() ? 0xff : 0xf; -} -#endif - -/** - * lapic_get_maxlvt - get the maximum number of local vector table entries - */ -int lapic_get_maxlvt(void) -{ - unsigned int v; - - v = apic_read(APIC_LVR); - /* - * - we always have APIC integrated on 64bit mode - * - 82489DXs do not report # of LVT entries - */ - return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2; -} - -/* - * Local APIC timer - */ - -/* Clock divisor */ -#define APIC_DIVISOR 16 - -/* - * This function sets up the local APIC timer, with a timeout of - * 'clocks' APIC bus clock. During calibration we actually call - * this function twice on the boot CPU, once with a bogus timeout - * value, second time for real. The other (noncalibrating) CPUs - * call this function only once, with the real, calibrated value. - * - * We do reads before writes even if unnecessary, to get around the - * P5 APIC double write bug. - */ -static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) -{ - unsigned int lvtt_value, tmp_value; - - lvtt_value = LOCAL_TIMER_VECTOR; - if (!oneshot) - lvtt_value |= APIC_LVT_TIMER_PERIODIC; - if (!lapic_is_integrated()) - lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); - - if (!irqen) - lvtt_value |= APIC_LVT_MASKED; - - apic_write(APIC_LVTT, lvtt_value); - - /* - * Divide PICLK by 16 - */ - tmp_value = apic_read(APIC_TDCR); - apic_write(APIC_TDCR, - (tmp_value & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) | - APIC_TDR_DIV_16); - - if (!oneshot) - apic_write(APIC_TMICT, clocks / APIC_DIVISOR); -} - -/* - * Setup extended LVT, AMD specific (K8, family 10h) - * - * Vector mappings are hard coded. On K8 only offset 0 (APIC500) and - * MCE interrupts are supported. Thus MCE offset must be set to 0. - * - * If mask=1, the LVT entry does not generate interrupts while mask=0 - * enables the vector. See also the BKDGs. - */ - -#define APIC_EILVT_LVTOFF_MCE 0 -#define APIC_EILVT_LVTOFF_IBS 1 - -static void setup_APIC_eilvt(u8 lvt_off, u8 vector, u8 msg_type, u8 mask) -{ - unsigned long reg = (lvt_off << 4) + APIC_EILVT0; - unsigned int v = (mask << 16) | (msg_type << 8) | vector; - - apic_write(reg, v); -} - -u8 setup_APIC_eilvt_mce(u8 vector, u8 msg_type, u8 mask) -{ - setup_APIC_eilvt(APIC_EILVT_LVTOFF_MCE, vector, msg_type, mask); - return APIC_EILVT_LVTOFF_MCE; -} - -u8 setup_APIC_eilvt_ibs(u8 vector, u8 msg_type, u8 mask) -{ - setup_APIC_eilvt(APIC_EILVT_LVTOFF_IBS, vector, msg_type, mask); - return APIC_EILVT_LVTOFF_IBS; -} -EXPORT_SYMBOL_GPL(setup_APIC_eilvt_ibs); - -/* - * Program the next event, relative to now - */ -static int lapic_next_event(unsigned long delta, - struct clock_event_device *evt) -{ - apic_write(APIC_TMICT, delta); - return 0; -} - -/* - * Setup the lapic timer in periodic or oneshot mode - */ -static void lapic_timer_setup(enum clock_event_mode mode, - struct clock_event_device *evt) -{ - unsigned long flags; - unsigned int v; - - /* Lapic used as dummy for broadcast ? */ - if (evt->features & CLOCK_EVT_FEAT_DUMMY) - return; - - local_irq_save(flags); - - switch (mode) { - case CLOCK_EVT_MODE_PERIODIC: - case CLOCK_EVT_MODE_ONESHOT: - __setup_APIC_LVTT(calibration_result, - mode != CLOCK_EVT_MODE_PERIODIC, 1); - break; - case CLOCK_EVT_MODE_UNUSED: - case CLOCK_EVT_MODE_SHUTDOWN: - v = apic_read(APIC_LVTT); - v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); - apic_write(APIC_LVTT, v); - apic_write(APIC_TMICT, 0xffffffff); - break; - case CLOCK_EVT_MODE_RESUME: - /* Nothing to do here */ - break; - } - - local_irq_restore(flags); -} - -/* - * Local APIC timer broadcast function - */ -static void lapic_timer_broadcast(const struct cpumask *mask) -{ -#ifdef CONFIG_SMP - send_IPI_mask(mask, LOCAL_TIMER_VECTOR); -#endif -} - -/* - * Setup the local APIC timer for this CPU. Copy the initilized values - * of the boot CPU and register the clock event in the framework. - */ -static void __cpuinit setup_APIC_timer(void) -{ - struct clock_event_device *levt = &__get_cpu_var(lapic_events); - - memcpy(levt, &lapic_clockevent, sizeof(*levt)); - levt->cpumask = cpumask_of(smp_processor_id()); - - clockevents_register_device(levt); -} - -/* - * In this functions we calibrate APIC bus clocks to the external timer. - * - * We want to do the calibration only once since we want to have local timer - * irqs syncron. CPUs connected by the same APIC bus have the very same bus - * frequency. - * - * This was previously done by reading the PIT/HPET and waiting for a wrap - * around to find out, that a tick has elapsed. I have a box, where the PIT - * readout is broken, so it never gets out of the wait loop again. This was - * also reported by others. - * - * Monitoring the jiffies value is inaccurate and the clockevents - * infrastructure allows us to do a simple substitution of the interrupt - * handler. - * - * The calibration routine also uses the pm_timer when possible, as the PIT - * happens to run way too slow (factor 2.3 on my VAIO CoreDuo, which goes - * back to normal later in the boot process). - */ - -#define LAPIC_CAL_LOOPS (HZ/10) - -static __initdata int lapic_cal_loops = -1; -static __initdata long lapic_cal_t1, lapic_cal_t2; -static __initdata unsigned long long lapic_cal_tsc1, lapic_cal_tsc2; -static __initdata unsigned long lapic_cal_pm1, lapic_cal_pm2; -static __initdata unsigned long lapic_cal_j1, lapic_cal_j2; - -/* - * Temporary interrupt handler. - */ -static void __init lapic_cal_handler(struct clock_event_device *dev) -{ - unsigned long long tsc = 0; - long tapic = apic_read(APIC_TMCCT); - unsigned long pm = acpi_pm_read_early(); - - if (cpu_has_tsc) - rdtscll(tsc); - - switch (lapic_cal_loops++) { - case 0: - lapic_cal_t1 = tapic; - lapic_cal_tsc1 = tsc; - lapic_cal_pm1 = pm; - lapic_cal_j1 = jiffies; - break; - - case LAPIC_CAL_LOOPS: - lapic_cal_t2 = tapic; - lapic_cal_tsc2 = tsc; - if (pm < lapic_cal_pm1) - pm += ACPI_PM_OVRRUN; - lapic_cal_pm2 = pm; - lapic_cal_j2 = jiffies; - break; - } -} - -static int __init calibrate_by_pmtimer(long deltapm, long *delta) -{ - const long pm_100ms = PMTMR_TICKS_PER_SEC / 10; - const long pm_thresh = pm_100ms / 100; - unsigned long mult; - u64 res; - -#ifndef CONFIG_X86_PM_TIMER - return -1; -#endif - - apic_printk(APIC_VERBOSE, "... PM timer delta = %ld\n", deltapm); - - /* Check, if the PM timer is available */ - if (!deltapm) - return -1; - - mult = clocksource_hz2mult(PMTMR_TICKS_PER_SEC, 22); - - if (deltapm > (pm_100ms - pm_thresh) && - deltapm < (pm_100ms + pm_thresh)) { - apic_printk(APIC_VERBOSE, "... PM timer result ok\n"); - } else { - res = (((u64)deltapm) * mult) >> 22; - do_div(res, 1000000); - pr_warning("APIC calibration not consistent " - "with PM Timer: %ldms instead of 100ms\n", - (long)res); - /* Correct the lapic counter value */ - res = (((u64)(*delta)) * pm_100ms); - do_div(res, deltapm); - pr_info("APIC delta adjusted to PM-Timer: " - "%lu (%ld)\n", (unsigned long)res, *delta); - *delta = (long)res; - } - - return 0; -} - -static int __init calibrate_APIC_clock(void) -{ - struct clock_event_device *levt = &__get_cpu_var(lapic_events); - void (*real_handler)(struct clock_event_device *dev); - unsigned long deltaj; - long delta; - int pm_referenced = 0; - - local_irq_disable(); - - /* Replace the global interrupt handler */ - real_handler = global_clock_event->event_handler; - global_clock_event->event_handler = lapic_cal_handler; - - /* - * Setup the APIC counter to maximum. There is no way the lapic - * can underflow in the 100ms detection time frame - */ - __setup_APIC_LVTT(0xffffffff, 0, 0); - - /* Let the interrupts run */ - local_irq_enable(); - - while (lapic_cal_loops <= LAPIC_CAL_LOOPS) - cpu_relax(); - - local_irq_disable(); - - /* Restore the real event handler */ - global_clock_event->event_handler = real_handler; - - /* Build delta t1-t2 as apic timer counts down */ - delta = lapic_cal_t1 - lapic_cal_t2; - apic_printk(APIC_VERBOSE, "... lapic delta = %ld\n", delta); - - /* we trust the PM based calibration if possible */ - pm_referenced = !calibrate_by_pmtimer(lapic_cal_pm2 - lapic_cal_pm1, - &delta); - - /* Calculate the scaled math multiplication factor */ - lapic_clockevent.mult = div_sc(delta, TICK_NSEC * LAPIC_CAL_LOOPS, - lapic_clockevent.shift); - lapic_clockevent.max_delta_ns = - clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); - lapic_clockevent.min_delta_ns = - clockevent_delta2ns(0xF, &lapic_clockevent); - - calibration_result = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; - - apic_printk(APIC_VERBOSE, "..... delta %ld\n", delta); - apic_printk(APIC_VERBOSE, "..... mult: %ld\n", lapic_clockevent.mult); - apic_printk(APIC_VERBOSE, "..... calibration result: %u\n", - calibration_result); - - if (cpu_has_tsc) { - delta = (long)(lapic_cal_tsc2 - lapic_cal_tsc1); - apic_printk(APIC_VERBOSE, "..... CPU clock speed is " - "%ld.%04ld MHz.\n", - (delta / LAPIC_CAL_LOOPS) / (1000000 / HZ), - (delta / LAPIC_CAL_LOOPS) % (1000000 / HZ)); - } - - apic_printk(APIC_VERBOSE, "..... host bus clock speed is " - "%u.%04u MHz.\n", - calibration_result / (1000000 / HZ), - calibration_result % (1000000 / HZ)); - - /* - * Do a sanity check on the APIC calibration result - */ - if (calibration_result < (1000000 / HZ)) { - local_irq_enable(); - pr_warning("APIC frequency too slow, disabling apic timer\n"); - return -1; - } - - levt->features &= ~CLOCK_EVT_FEAT_DUMMY; - - /* - * PM timer calibration failed or not turned on - * so lets try APIC timer based calibration - */ - if (!pm_referenced) { - apic_printk(APIC_VERBOSE, "... verify APIC timer\n"); - - /* - * Setup the apic timer manually - */ - levt->event_handler = lapic_cal_handler; - lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt); - lapic_cal_loops = -1; - - /* Let the interrupts run */ - local_irq_enable(); - - while (lapic_cal_loops <= LAPIC_CAL_LOOPS) - cpu_relax(); - - /* Stop the lapic timer */ - lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt); - - /* Jiffies delta */ - deltaj = lapic_cal_j2 - lapic_cal_j1; - apic_printk(APIC_VERBOSE, "... jiffies delta = %lu\n", deltaj); - - /* Check, if the jiffies result is consistent */ - if (deltaj >= LAPIC_CAL_LOOPS-2 && deltaj <= LAPIC_CAL_LOOPS+2) - apic_printk(APIC_VERBOSE, "... jiffies result ok\n"); - else - levt->features |= CLOCK_EVT_FEAT_DUMMY; - } else - local_irq_enable(); - - if (levt->features & CLOCK_EVT_FEAT_DUMMY) { - pr_warning("APIC timer disabled due to verification failure\n"); - return -1; - } - - return 0; -} - -/* - * Setup the boot APIC - * - * Calibrate and verify the result. - */ -void __init setup_boot_APIC_clock(void) -{ - /* - * The local apic timer can be disabled via the kernel - * commandline or from the CPU detection code. Register the lapic - * timer as a dummy clock event source on SMP systems, so the - * broadcast mechanism is used. On UP systems simply ignore it. - */ - if (disable_apic_timer) { - pr_info("Disabling APIC timer\n"); - /* No broadcast on UP ! */ - if (num_possible_cpus() > 1) { - lapic_clockevent.mult = 1; - setup_APIC_timer(); - } - return; - } - - apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n" - "calibrating APIC timer ...\n"); - - if (calibrate_APIC_clock()) { - /* No broadcast on UP ! */ - if (num_possible_cpus() > 1) - setup_APIC_timer(); - return; - } - - /* - * If nmi_watchdog is set to IO_APIC, we need the - * PIT/HPET going. Otherwise register lapic as a dummy - * device. - */ - if (nmi_watchdog != NMI_IO_APIC) - lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY; - else - pr_warning("APIC timer registered as dummy," - " due to nmi_watchdog=%d!\n", nmi_watchdog); - - /* Setup the lapic or request the broadcast */ - setup_APIC_timer(); -} - -void __cpuinit setup_secondary_APIC_clock(void) -{ - setup_APIC_timer(); -} - -/* - * The guts of the apic timer interrupt - */ -static void local_apic_timer_interrupt(void) -{ - int cpu = smp_processor_id(); - struct clock_event_device *evt = &per_cpu(lapic_events, cpu); - - /* - * Normally we should not be here till LAPIC has been initialized but - * in some cases like kdump, its possible that there is a pending LAPIC - * timer interrupt from previous kernel's context and is delivered in - * new kernel the moment interrupts are enabled. - * - * Interrupts are enabled early and LAPIC is setup much later, hence - * its possible that when we get here evt->event_handler is NULL. - * Check for event_handler being NULL and discard the interrupt as - * spurious. - */ - if (!evt->event_handler) { - pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu); - /* Switch it off */ - lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); - return; - } - - /* - * the NMI deadlock-detector uses this. - */ - inc_irq_stat(apic_timer_irqs); - - evt->event_handler(evt); -} - -/* - * Local APIC timer interrupt. This is the most natural way for doing - * local interrupts, but local timer interrupts can be emulated by - * broadcast interrupts too. [in case the hw doesn't support APIC timers] - * - * [ if a single-CPU system runs an SMP kernel then we call the local - * interrupt as well. Thus we cannot inline the local irq ... ] - */ -void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs) -{ - struct pt_regs *old_regs = set_irq_regs(regs); - - /* - * NOTE! We'd better ACK the irq immediately, - * because timer handling can be slow. - */ - ack_APIC_irq(); - /* - * update_process_times() expects us to have done irq_enter(). - * Besides, if we don't timer interrupts ignore the global - * interrupt lock, which is the WrongThing (tm) to do. - */ - exit_idle(); - irq_enter(); - local_apic_timer_interrupt(); - irq_exit(); - - set_irq_regs(old_regs); -} - -int setup_profiling_timer(unsigned int multiplier) -{ - return -EINVAL; -} - -/* - * Local APIC start and shutdown - */ - -/** - * clear_local_APIC - shutdown the local APIC - * - * This is called, when a CPU is disabled and before rebooting, so the state of - * the local APIC has no dangling leftovers. Also used to cleanout any BIOS - * leftovers during boot. - */ -void clear_local_APIC(void) -{ - int maxlvt; - u32 v; - - /* APIC hasn't been mapped yet */ - if (!apic_phys) - return; - - maxlvt = lapic_get_maxlvt(); - /* - * Masking an LVT entry can trigger a local APIC error - * if the vector is zero. Mask LVTERR first to prevent this. - */ - if (maxlvt >= 3) { - v = ERROR_APIC_VECTOR; /* any non-zero vector will do */ - apic_write(APIC_LVTERR, v | APIC_LVT_MASKED); - } - /* - * Careful: we have to set masks only first to deassert - * any level-triggered sources. - */ - v = apic_read(APIC_LVTT); - apic_write(APIC_LVTT, v | APIC_LVT_MASKED); - v = apic_read(APIC_LVT0); - apic_write(APIC_LVT0, v | APIC_LVT_MASKED); - v = apic_read(APIC_LVT1); - apic_write(APIC_LVT1, v | APIC_LVT_MASKED); - if (maxlvt >= 4) { - v = apic_read(APIC_LVTPC); - apic_write(APIC_LVTPC, v | APIC_LVT_MASKED); - } - - /* lets not touch this if we didn't frob it */ -#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) - if (maxlvt >= 5) { - v = apic_read(APIC_LVTTHMR); - apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED); - } -#endif - /* - * Clean APIC state for other OSs: - */ - apic_write(APIC_LVTT, APIC_LVT_MASKED); - apic_write(APIC_LVT0, APIC_LVT_MASKED); - apic_write(APIC_LVT1, APIC_LVT_MASKED); - if (maxlvt >= 3) - apic_write(APIC_LVTERR, APIC_LVT_MASKED); - if (maxlvt >= 4) - apic_write(APIC_LVTPC, APIC_LVT_MASKED); - - /* Integrated APIC (!82489DX) ? */ - if (lapic_is_integrated()) { - if (maxlvt > 3) - /* Clear ESR due to Pentium errata 3AP and 11AP */ - apic_write(APIC_ESR, 0); - apic_read(APIC_ESR); - } -} - -/** - * disable_local_APIC - clear and disable the local APIC - */ -void disable_local_APIC(void) -{ - unsigned int value; - - /* APIC hasn't been mapped yet */ - if (!apic_phys) - return; - - clear_local_APIC(); - - /* - * Disable APIC (implies clearing of registers - * for 82489DX!). - */ - value = apic_read(APIC_SPIV); - value &= ~APIC_SPIV_APIC_ENABLED; - apic_write(APIC_SPIV, value); - -#ifdef CONFIG_X86_32 - /* - * When LAPIC was disabled by the BIOS and enabled by the kernel, - * restore the disabled state. - */ - if (enabled_via_apicbase) { - unsigned int l, h; - - rdmsr(MSR_IA32_APICBASE, l, h); - l &= ~MSR_IA32_APICBASE_ENABLE; - wrmsr(MSR_IA32_APICBASE, l, h); - } -#endif -} - -/* - * If Linux enabled the LAPIC against the BIOS default disable it down before - * re-entering the BIOS on shutdown. Otherwise the BIOS may get confused and - * not power-off. Additionally clear all LVT entries before disable_local_APIC - * for the case where Linux didn't enable the LAPIC. - */ -void lapic_shutdown(void) -{ - unsigned long flags; - - if (!cpu_has_apic) - return; - - local_irq_save(flags); - -#ifdef CONFIG_X86_32 - if (!enabled_via_apicbase) - clear_local_APIC(); - else -#endif - disable_local_APIC(); - - - local_irq_restore(flags); -} - -/* - * This is to verify that we're looking at a real local APIC. - * Check these against your board if the CPUs aren't getting - * started for no apparent reason. - */ -int __init verify_local_APIC(void) -{ - unsigned int reg0, reg1; - - /* - * The version register is read-only in a real APIC. - */ - reg0 = apic_read(APIC_LVR); - apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg0); - apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK); - reg1 = apic_read(APIC_LVR); - apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg1); - - /* - * The two version reads above should print the same - * numbers. If the second one is different, then we - * poke at a non-APIC. - */ - if (reg1 != reg0) - return 0; - - /* - * Check if the version looks reasonably. - */ - reg1 = GET_APIC_VERSION(reg0); - if (reg1 == 0x00 || reg1 == 0xff) - return 0; - reg1 = lapic_get_maxlvt(); - if (reg1 < 0x02 || reg1 == 0xff) - return 0; - - /* - * The ID register is read/write in a real APIC. - */ - reg0 = apic_read(APIC_ID); - apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg0); - apic_write(APIC_ID, reg0 ^ APIC_ID_MASK); - reg1 = apic_read(APIC_ID); - apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg1); - apic_write(APIC_ID, reg0); - if (reg1 != (reg0 ^ APIC_ID_MASK)) - return 0; - - /* - * The next two are just to see if we have sane values. - * They're only really relevant if we're in Virtual Wire - * compatibility mode, but most boxes are anymore. - */ - reg0 = apic_read(APIC_LVT0); - apic_printk(APIC_DEBUG, "Getting LVT0: %x\n", reg0); - reg1 = apic_read(APIC_LVT1); - apic_printk(APIC_DEBUG, "Getting LVT1: %x\n", reg1); - - return 1; -} - -/** - * sync_Arb_IDs - synchronize APIC bus arbitration IDs - */ -void __init sync_Arb_IDs(void) -{ - /* - * Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 And not - * needed on AMD. - */ - if (modern_apic() || boot_cpu_data.x86_vendor == X86_VENDOR_AMD) - return; - - /* - * Wait for idle. - */ - apic_wait_icr_idle(); - - apic_printk(APIC_DEBUG, "Synchronizing Arb IDs.\n"); - apic_write(APIC_ICR, APIC_DEST_ALLINC | - APIC_INT_LEVELTRIG | APIC_DM_INIT); -} - -/* - * An initial setup of the virtual wire mode. - */ -void __init init_bsp_APIC(void) -{ - unsigned int value; - - /* - * Don't do the setup now if we have a SMP BIOS as the - * through-I/O-APIC virtual wire mode might be active. - */ - if (smp_found_config || !cpu_has_apic) - return; - - /* - * Do not trust the local APIC being empty at bootup. - */ - clear_local_APIC(); - - /* - * Enable APIC. - */ - value = apic_read(APIC_SPIV); - value &= ~APIC_VECTOR_MASK; - value |= APIC_SPIV_APIC_ENABLED; - -#ifdef CONFIG_X86_32 - /* This bit is reserved on P4/Xeon and should be cleared */ - if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && - (boot_cpu_data.x86 == 15)) - value &= ~APIC_SPIV_FOCUS_DISABLED; - else -#endif - value |= APIC_SPIV_FOCUS_DISABLED; - value |= SPURIOUS_APIC_VECTOR; - apic_write(APIC_SPIV, value); - - /* - * Set up the virtual wire mode. - */ - apic_write(APIC_LVT0, APIC_DM_EXTINT); - value = APIC_DM_NMI; - if (!lapic_is_integrated()) /* 82489DX */ - value |= APIC_LVT_LEVEL_TRIGGER; - apic_write(APIC_LVT1, value); -} - -static void __cpuinit lapic_setup_esr(void) -{ - unsigned int oldvalue, value, maxlvt; - - if (!lapic_is_integrated()) { - pr_info("No ESR for 82489DX.\n"); - return; - } - - if (esr_disable) { - /* - * Something untraceable is creating bad interrupts on - * secondary quads ... for the moment, just leave the - * ESR disabled - we can't do anything useful with the - * errors anyway - mbligh - */ - pr_info("Leaving ESR disabled.\n"); - return; - } - - maxlvt = lapic_get_maxlvt(); - if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */ - apic_write(APIC_ESR, 0); - oldvalue = apic_read(APIC_ESR); - - /* enables sending errors */ - value = ERROR_APIC_VECTOR; - apic_write(APIC_LVTERR, value); - - /* - * spec says clear errors after enabling vector. - */ - if (maxlvt > 3) - apic_write(APIC_ESR, 0); - value = apic_read(APIC_ESR); - if (value != oldvalue) - apic_printk(APIC_VERBOSE, "ESR value before enabling " - "vector: 0x%08x after: 0x%08x\n", - oldvalue, value); -} - - -/** - * setup_local_APIC - setup the local APIC - */ -void __cpuinit setup_local_APIC(void) -{ - unsigned int value; - int i, j; - -#ifdef CONFIG_X86_32 - /* Pound the ESR really hard over the head with a big hammer - mbligh */ - if (lapic_is_integrated() && esr_disable) { - apic_write(APIC_ESR, 0); - apic_write(APIC_ESR, 0); - apic_write(APIC_ESR, 0); - apic_write(APIC_ESR, 0); - } -#endif - - preempt_disable(); - - /* - * Double-check whether this APIC is really registered. - * This is meaningless in clustered apic mode, so we skip it. - */ - if (!apic_id_registered()) - BUG(); - - /* - * Intel recommends to set DFR, LDR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ - init_apic_ldr(); - - /* - * Set Task Priority to 'accept all'. We never change this - * later on. - */ - value = apic_read(APIC_TASKPRI); - value &= ~APIC_TPRI_MASK; - apic_write(APIC_TASKPRI, value); - - /* - * After a crash, we no longer service the interrupts and a pending - * interrupt from previous kernel might still have ISR bit set. - * - * Most probably by now CPU has serviced that pending interrupt and - * it might not have done the ack_APIC_irq() because it thought, - * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it - * does not clear the ISR bit and cpu thinks it has already serivced - * the interrupt. Hence a vector might get locked. It was noticed - * for timer irq (vector 0x31). Issue an extra EOI to clear ISR. - */ - for (i = APIC_ISR_NR - 1; i >= 0; i--) { - value = apic_read(APIC_ISR + i*0x10); - for (j = 31; j >= 0; j--) { - if (value & (1< 1) || - (boot_cpu_data.x86 >= 15)) - break; - goto no_apic; - case X86_VENDOR_INTEL: - if (boot_cpu_data.x86 == 6 || boot_cpu_data.x86 == 15 || - (boot_cpu_data.x86 == 5 && cpu_has_apic)) - break; - goto no_apic; - default: - goto no_apic; - } - - if (!cpu_has_apic) { - /* - * Over-ride BIOS and try to enable the local APIC only if - * "lapic" specified. - */ - if (!force_enable_local_apic) { - pr_info("Local APIC disabled by BIOS -- " - "you can enable it with \"lapic\"\n"); - return -1; - } - /* - * Some BIOSes disable the local APIC in the APIC_BASE - * MSR. This can only be done in software for Intel P6 or later - * and AMD K7 (Model > 1) or later. - */ - rdmsr(MSR_IA32_APICBASE, l, h); - if (!(l & MSR_IA32_APICBASE_ENABLE)) { - pr_info("Local APIC disabled by BIOS -- reenabling.\n"); - l &= ~MSR_IA32_APICBASE_BASE; - l |= MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE; - wrmsr(MSR_IA32_APICBASE, l, h); - enabled_via_apicbase = 1; - } - } - /* - * The APIC feature bit should now be enabled - * in `cpuid' - */ - features = cpuid_edx(1); - if (!(features & (1 << X86_FEATURE_APIC))) { - pr_warning("Could not enable APIC!\n"); - return -1; - } - set_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); - mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; - - /* The BIOS may have set up the APIC at some other address */ - rdmsr(MSR_IA32_APICBASE, l, h); - if (l & MSR_IA32_APICBASE_ENABLE) - mp_lapic_addr = l & MSR_IA32_APICBASE_BASE; - - pr_info("Found and enabled local APIC!\n"); - - apic_pm_activate(); - - return 0; - -no_apic: - pr_info("No local APIC present or hardware disabled\n"); - return -1; -} -#endif - -#ifdef CONFIG_X86_64 -void __init early_init_lapic_mapping(void) -{ - unsigned long phys_addr; - - /* - * If no local APIC can be found then go out - * : it means there is no mpatable and MADT - */ - if (!smp_found_config) - return; - - phys_addr = mp_lapic_addr; - - set_fixmap_nocache(FIX_APIC_BASE, phys_addr); - apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n", - APIC_BASE, phys_addr); - - /* - * Fetch the APIC ID of the BSP in case we have a - * default configuration (or the MP table is broken). - */ - boot_cpu_physical_apicid = read_apic_id(); -} -#endif - -/** - * init_apic_mappings - initialize APIC mappings - */ -void __init init_apic_mappings(void) -{ -#ifdef HAVE_X2APIC - if (x2apic) { - boot_cpu_physical_apicid = read_apic_id(); - return; - } -#endif - - /* - * If no local APIC can be found then set up a fake all - * zeroes page to simulate the local APIC and another - * one for the IO-APIC. - */ - if (!smp_found_config && detect_init_APIC()) { - apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE); - apic_phys = __pa(apic_phys); - } else - apic_phys = mp_lapic_addr; - - set_fixmap_nocache(FIX_APIC_BASE, apic_phys); - apic_printk(APIC_VERBOSE, "mapped APIC to %08lx (%08lx)\n", - APIC_BASE, apic_phys); - - /* - * Fetch the APIC ID of the BSP in case we have a - * default configuration (or the MP table is broken). - */ - if (boot_cpu_physical_apicid == -1U) - boot_cpu_physical_apicid = read_apic_id(); -} - -/* - * This initializes the IO-APIC and APIC hardware if this is - * a UP kernel. - */ -int apic_version[MAX_APICS]; - -int __init APIC_init_uniprocessor(void) -{ -#ifdef CONFIG_X86_64 - if (disable_apic) { - pr_info("Apic disabled\n"); - return -1; - } - if (!cpu_has_apic) { - disable_apic = 1; - pr_info("Apic disabled by BIOS\n"); - return -1; - } -#else - if (!smp_found_config && !cpu_has_apic) - return -1; - - /* - * Complain if the BIOS pretends there is one. - */ - if (!cpu_has_apic && - APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) { - pr_err("BIOS bug, local APIC 0x%x not detected!...\n", - boot_cpu_physical_apicid); - clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); - return -1; - } -#endif - -#ifdef HAVE_X2APIC - enable_IR_x2apic(); -#endif -#ifdef CONFIG_X86_64 - setup_apic_routing(); -#endif - - verify_local_APIC(); - connect_bsp_APIC(); - -#ifdef CONFIG_X86_64 - apic_write(APIC_ID, SET_APIC_ID(boot_cpu_physical_apicid)); -#else - /* - * Hack: In case of kdump, after a crash, kernel might be booting - * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid - * might be zero if read from MP tables. Get it from LAPIC. - */ -# ifdef CONFIG_CRASH_DUMP - boot_cpu_physical_apicid = read_apic_id(); -# endif -#endif - physid_set_mask_of_physid(boot_cpu_physical_apicid, &phys_cpu_present_map); - setup_local_APIC(); - -#ifdef CONFIG_X86_64 - /* - * Now enable IO-APICs, actually call clear_IO_APIC - * We need clear_IO_APIC before enabling vector on BP - */ - if (!skip_ioapic_setup && nr_ioapics) - enable_IO_APIC(); -#endif - -#ifdef CONFIG_X86_IO_APIC - if (!smp_found_config || skip_ioapic_setup || !nr_ioapics) -#endif - localise_nmi_watchdog(); - end_local_APIC_setup(); - -#ifdef CONFIG_X86_IO_APIC - if (smp_found_config && !skip_ioapic_setup && nr_ioapics) - setup_IO_APIC(); -# ifdef CONFIG_X86_64 - else - nr_ioapics = 0; -# endif -#endif - -#ifdef CONFIG_X86_64 - setup_boot_APIC_clock(); - check_nmi_watchdog(); -#else - setup_boot_clock(); -#endif - - return 0; -} - -/* - * Local APIC interrupts - */ - -/* - * This interrupt should _never_ happen with our APIC/SMP architecture - */ -void smp_spurious_interrupt(struct pt_regs *regs) -{ - u32 v; - - exit_idle(); - irq_enter(); - /* - * Check if this really is a spurious interrupt and ACK it - * if it is a vectored one. Just in case... - * Spurious interrupts should not be ACKed. - */ - v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1)); - if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f))) - ack_APIC_irq(); - - inc_irq_stat(irq_spurious_count); - - /* see sw-dev-man vol 3, chapter 7.4.13.5 */ - pr_info("spurious APIC interrupt on CPU#%d, " - "should never happen.\n", smp_processor_id()); - irq_exit(); -} - -/* - * This interrupt should never happen with our APIC/SMP architecture - */ -void smp_error_interrupt(struct pt_regs *regs) -{ - u32 v, v1; - - exit_idle(); - irq_enter(); - /* First tickle the hardware, only then report what went on. -- REW */ - v = apic_read(APIC_ESR); - apic_write(APIC_ESR, 0); - v1 = apic_read(APIC_ESR); - ack_APIC_irq(); - atomic_inc(&irq_err_count); - - /* - * Here is what the APIC error bits mean: - * 0: Send CS error - * 1: Receive CS error - * 2: Send accept error - * 3: Receive accept error - * 4: Reserved - * 5: Send illegal vector - * 6: Received illegal vector - * 7: Illegal register address - */ - pr_debug("APIC error on CPU%d: %02x(%02x)\n", - smp_processor_id(), v , v1); - irq_exit(); -} - -/** - * connect_bsp_APIC - attach the APIC to the interrupt system - */ -void __init connect_bsp_APIC(void) -{ -#ifdef CONFIG_X86_32 - if (pic_mode) { - /* - * Do not trust the local APIC being empty at bootup. - */ - clear_local_APIC(); - /* - * PIC mode, enable APIC mode in the IMCR, i.e. connect BSP's - * local APIC to INT and NMI lines. - */ - apic_printk(APIC_VERBOSE, "leaving PIC mode, " - "enabling APIC mode.\n"); - outb(0x70, 0x22); - outb(0x01, 0x23); - } -#endif - enable_apic_mode(); -} - -/** - * disconnect_bsp_APIC - detach the APIC from the interrupt system - * @virt_wire_setup: indicates, whether virtual wire mode is selected - * - * Virtual wire mode is necessary to deliver legacy interrupts even when the - * APIC is disabled. - */ -void disconnect_bsp_APIC(int virt_wire_setup) -{ - unsigned int value; - -#ifdef CONFIG_X86_32 - if (pic_mode) { - /* - * Put the board back into PIC mode (has an effect only on - * certain older boards). Note that APIC interrupts, including - * IPIs, won't work beyond this point! The only exception are - * INIT IPIs. - */ - apic_printk(APIC_VERBOSE, "disabling APIC mode, " - "entering PIC mode.\n"); - outb(0x70, 0x22); - outb(0x00, 0x23); - return; - } -#endif - - /* Go back to Virtual Wire compatibility mode */ - - /* For the spurious interrupt use vector F, and enable it */ - value = apic_read(APIC_SPIV); - value &= ~APIC_VECTOR_MASK; - value |= APIC_SPIV_APIC_ENABLED; - value |= 0xf; - apic_write(APIC_SPIV, value); - - if (!virt_wire_setup) { - /* - * For LVT0 make it edge triggered, active high, - * external and enabled - */ - value = apic_read(APIC_LVT0); - value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | - APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | - APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); - value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; - value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); - apic_write(APIC_LVT0, value); - } else { - /* Disable LVT0 */ - apic_write(APIC_LVT0, APIC_LVT_MASKED); - } - - /* - * For LVT1 make it edge triggered, active high, - * nmi and enabled - */ - value = apic_read(APIC_LVT1); - value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | - APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | - APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); - value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; - value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); - apic_write(APIC_LVT1, value); -} - -void __cpuinit generic_processor_info(int apicid, int version) -{ - int cpu; - - /* - * Validate version - */ - if (version == 0x0) { - pr_warning("BIOS bug, APIC version is 0 for CPU#%d! " - "fixing up to 0x10. (tell your hw vendor)\n", - version); - version = 0x10; - } - apic_version[apicid] = version; - - if (num_processors >= nr_cpu_ids) { - int max = nr_cpu_ids; - int thiscpu = max + disabled_cpus; - - pr_warning( - "ACPI: NR_CPUS/possible_cpus limit of %i reached." - " Processor %d/0x%x ignored.\n", max, thiscpu, apicid); - - disabled_cpus++; - return; - } - - num_processors++; - cpu = cpumask_next_zero(-1, cpu_present_mask); - - if (version != apic_version[boot_cpu_physical_apicid]) - WARN_ONCE(1, - "ACPI: apic version mismatch, bootcpu: %x cpu %d: %x\n", - apic_version[boot_cpu_physical_apicid], cpu, version); - - physid_set(apicid, phys_cpu_present_map); - if (apicid == boot_cpu_physical_apicid) { - /* - * x86_bios_cpu_apicid is required to have processors listed - * in same order as logical cpu numbers. Hence the first - * entry is BSP, and so on. - */ - cpu = 0; - } - if (apicid > max_physical_apicid) - max_physical_apicid = apicid; - -#ifdef CONFIG_X86_32 - /* - * Would be preferable to switch to bigsmp when CONFIG_HOTPLUG_CPU=y - * but we need to work other dependencies like SMP_SUSPEND etc - * before this can be done without some confusion. - * if (CPU_HOTPLUG_ENABLED || num_processors > 8) - * - Ashok Raj - */ - if (max_physical_apicid >= 8) { - switch (boot_cpu_data.x86_vendor) { - case X86_VENDOR_INTEL: - if (!APIC_XAPIC(version)) { - def_to_bigsmp = 0; - break; - } - /* If P4 and above fall through */ - case X86_VENDOR_AMD: - def_to_bigsmp = 1; - } - } -#endif - -#if defined(CONFIG_X86_SMP) || defined(CONFIG_X86_64) - /* are we being called early in kernel startup? */ - if (early_per_cpu_ptr(x86_cpu_to_apicid)) { - u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid); - u16 *bios_cpu_apicid = early_per_cpu_ptr(x86_bios_cpu_apicid); - - cpu_to_apicid[cpu] = apicid; - bios_cpu_apicid[cpu] = apicid; - } else { - per_cpu(x86_cpu_to_apicid, cpu) = apicid; - per_cpu(x86_bios_cpu_apicid, cpu) = apicid; - } -#endif - - set_cpu_possible(cpu, true); - set_cpu_present(cpu, true); -} - -#ifdef CONFIG_X86_64 -int hard_smp_processor_id(void) -{ - return read_apic_id(); -} -#endif - -/* - * Power management - */ -#ifdef CONFIG_PM - -static struct { - /* - * 'active' is true if the local APIC was enabled by us and - * not the BIOS; this signifies that we are also responsible - * for disabling it before entering apm/acpi suspend - */ - int active; - /* r/w apic fields */ - unsigned int apic_id; - unsigned int apic_taskpri; - unsigned int apic_ldr; - unsigned int apic_dfr; - unsigned int apic_spiv; - unsigned int apic_lvtt; - unsigned int apic_lvtpc; - unsigned int apic_lvt0; - unsigned int apic_lvt1; - unsigned int apic_lvterr; - unsigned int apic_tmict; - unsigned int apic_tdcr; - unsigned int apic_thmr; -} apic_pm_state; - -static int lapic_suspend(struct sys_device *dev, pm_message_t state) -{ - unsigned long flags; - int maxlvt; - - if (!apic_pm_state.active) - return 0; - - maxlvt = lapic_get_maxlvt(); - - apic_pm_state.apic_id = apic_read(APIC_ID); - apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); - apic_pm_state.apic_ldr = apic_read(APIC_LDR); - apic_pm_state.apic_dfr = apic_read(APIC_DFR); - apic_pm_state.apic_spiv = apic_read(APIC_SPIV); - apic_pm_state.apic_lvtt = apic_read(APIC_LVTT); - if (maxlvt >= 4) - apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); - apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); - apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); - apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); - apic_pm_state.apic_tmict = apic_read(APIC_TMICT); - apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); -#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) - if (maxlvt >= 5) - apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); -#endif - - local_irq_save(flags); - disable_local_APIC(); - local_irq_restore(flags); - return 0; -} - -static int lapic_resume(struct sys_device *dev) -{ - unsigned int l, h; - unsigned long flags; - int maxlvt; - - if (!apic_pm_state.active) - return 0; - - maxlvt = lapic_get_maxlvt(); - - local_irq_save(flags); - -#ifdef HAVE_X2APIC - if (x2apic) - enable_x2apic(); - else -#endif - { - /* - * Make sure the APICBASE points to the right address - * - * FIXME! This will be wrong if we ever support suspend on - * SMP! We'll need to do this as part of the CPU restore! - */ - rdmsr(MSR_IA32_APICBASE, l, h); - l &= ~MSR_IA32_APICBASE_BASE; - l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr; - wrmsr(MSR_IA32_APICBASE, l, h); - } - - apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED); - apic_write(APIC_ID, apic_pm_state.apic_id); - apic_write(APIC_DFR, apic_pm_state.apic_dfr); - apic_write(APIC_LDR, apic_pm_state.apic_ldr); - apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri); - apic_write(APIC_SPIV, apic_pm_state.apic_spiv); - apic_write(APIC_LVT0, apic_pm_state.apic_lvt0); - apic_write(APIC_LVT1, apic_pm_state.apic_lvt1); -#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) - if (maxlvt >= 5) - apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); -#endif - if (maxlvt >= 4) - apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); - apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); - apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); - apic_write(APIC_TMICT, apic_pm_state.apic_tmict); - apic_write(APIC_ESR, 0); - apic_read(APIC_ESR); - apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); - apic_write(APIC_ESR, 0); - apic_read(APIC_ESR); - - local_irq_restore(flags); - - return 0; -} - -/* - * This device has no shutdown method - fully functioning local APICs - * are needed on every CPU up until machine_halt/restart/poweroff. - */ - -static struct sysdev_class lapic_sysclass = { - .name = "lapic", - .resume = lapic_resume, - .suspend = lapic_suspend, -}; - -static struct sys_device device_lapic = { - .id = 0, - .cls = &lapic_sysclass, -}; - -static void __cpuinit apic_pm_activate(void) -{ - apic_pm_state.active = 1; -} - -static int __init init_lapic_sysfs(void) -{ - int error; - - if (!cpu_has_apic) - return 0; - /* XXX: remove suspend/resume procs if !apic_pm_state.active? */ - - error = sysdev_class_register(&lapic_sysclass); - if (!error) - error = sysdev_register(&device_lapic); - return error; -} -device_initcall(init_lapic_sysfs); - -#else /* CONFIG_PM */ - -static void apic_pm_activate(void) { } - -#endif /* CONFIG_PM */ - -#ifdef CONFIG_X86_64 -/* - * apic_is_clustered_box() -- Check if we can expect good TSC - * - * Thus far, the major user of this is IBM's Summit2 series: - * - * Clustered boxes may have unsynced TSC problems if they are - * multi-chassis. Use available data to take a good guess. - * If in doubt, go HPET. - */ -__cpuinit int apic_is_clustered_box(void) -{ - int i, clusters, zeros; - unsigned id; - u16 *bios_cpu_apicid; - DECLARE_BITMAP(clustermap, NUM_APIC_CLUSTERS); - - /* - * there is not this kind of box with AMD CPU yet. - * Some AMD box with quadcore cpu and 8 sockets apicid - * will be [4, 0x23] or [8, 0x27] could be thought to - * vsmp box still need checking... - */ - if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && !is_vsmp_box()) - return 0; - - bios_cpu_apicid = early_per_cpu_ptr(x86_bios_cpu_apicid); - bitmap_zero(clustermap, NUM_APIC_CLUSTERS); - - for (i = 0; i < nr_cpu_ids; i++) { - /* are we being called early in kernel startup? */ - if (bios_cpu_apicid) { - id = bios_cpu_apicid[i]; - } else if (i < nr_cpu_ids) { - if (cpu_present(i)) - id = per_cpu(x86_bios_cpu_apicid, i); - else - continue; - } else - break; - - if (id != BAD_APICID) - __set_bit(APIC_CLUSTERID(id), clustermap); - } - - /* Problem: Partially populated chassis may not have CPUs in some of - * the APIC clusters they have been allocated. Only present CPUs have - * x86_bios_cpu_apicid entries, thus causing zeroes in the bitmap. - * Since clusters are allocated sequentially, count zeros only if - * they are bounded by ones. - */ - clusters = 0; - zeros = 0; - for (i = 0; i < NUM_APIC_CLUSTERS; i++) { - if (test_bit(i, clustermap)) { - clusters += 1 + zeros; - zeros = 0; - } else - ++zeros; - } - - /* ScaleMP vSMPowered boxes have one cluster per board and TSCs are - * not guaranteed to be synced between boards - */ - if (is_vsmp_box() && clusters > 1) - return 1; - - /* - * If clusters > 2, then should be multi-chassis. - * May have to revisit this when multi-core + hyperthreaded CPUs come - * out, but AFAIK this will work even for them. - */ - return (clusters > 2); -} -#endif - -/* - * APIC command line parameters - */ -static int __init setup_disableapic(char *arg) -{ - disable_apic = 1; - setup_clear_cpu_cap(X86_FEATURE_APIC); - return 0; -} -early_param("disableapic", setup_disableapic); - -/* same as disableapic, for compatibility */ -static int __init setup_nolapic(char *arg) -{ - return setup_disableapic(arg); -} -early_param("nolapic", setup_nolapic); - -static int __init parse_lapic_timer_c2_ok(char *arg) -{ - local_apic_timer_c2_ok = 1; - return 0; -} -early_param("lapic_timer_c2_ok", parse_lapic_timer_c2_ok); - -static int __init parse_disable_apic_timer(char *arg) -{ - disable_apic_timer = 1; - return 0; -} -early_param("noapictimer", parse_disable_apic_timer); - -static int __init parse_nolapic_timer(char *arg) -{ - disable_apic_timer = 1; - return 0; -} -early_param("nolapic_timer", parse_nolapic_timer); - -static int __init apic_set_verbosity(char *arg) -{ - if (!arg) { -#ifdef CONFIG_X86_64 - skip_ioapic_setup = 0; - return 0; -#endif - return -EINVAL; - } - - if (strcmp("debug", arg) == 0) - apic_verbosity = APIC_DEBUG; - else if (strcmp("verbose", arg) == 0) - apic_verbosity = APIC_VERBOSE; - else { - pr_warning("APIC Verbosity level %s not recognised" - " use apic=verbose or apic=debug\n", arg); - return -EINVAL; - } - - return 0; -} -early_param("apic", apic_set_verbosity); - -static int __init lapic_insert_resource(void) -{ - if (!apic_phys) - return -1; - - /* Put local APIC into the resource map. */ - lapic_resource.start = apic_phys; - lapic_resource.end = lapic_resource.start + PAGE_SIZE - 1; - insert_resource(&iomem_resource, &lapic_resource); - - return 0; -} - -/* - * need call insert after e820_reserve_resources() - * that is using request_resource - */ -late_initcall(lapic_insert_resource); Index: linux-2.6-tip/arch/x86/kernel/apic/Makefile =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/Makefile @@ -0,0 +1,19 @@ +# +# Makefile for local APIC drivers and for the IO-APIC code +# + +obj-$(CONFIG_X86_LOCAL_APIC) += apic.o probe_$(BITS).o ipi.o nmi.o +obj-$(CONFIG_X86_IO_APIC) += io_apic.o +obj-$(CONFIG_SMP) += ipi.o + +ifeq ($(CONFIG_X86_64),y) +obj-y += apic_flat_64.o +obj-$(CONFIG_X86_X2APIC) += x2apic_cluster.o +obj-$(CONFIG_X86_X2APIC) += x2apic_phys.o +obj-$(CONFIG_X86_UV) += x2apic_uv_x.o +endif + +obj-$(CONFIG_X86_BIGSMP) += bigsmp_32.o +obj-$(CONFIG_X86_NUMAQ) += numaq_32.o +obj-$(CONFIG_X86_ES7000) += es7000_32.o +obj-$(CONFIG_X86_SUMMIT) += summit_32.o Index: linux-2.6-tip/arch/x86/kernel/apic/apic.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/apic.c @@ -0,0 +1,2208 @@ +/* + * Local APIC handling, local APIC timers + * + * (c) 1999, 2000, 2009 Ingo Molnar + * + * Fixes + * Maciej W. Rozycki : Bits for genuine 82489DX APICs; + * thanks to Eric Gilmore + * and Rolf G. Tews + * for testing these extensively. + * Maciej W. Rozycki : Various updates and fixes. + * Mikael Pettersson : Power Management for UP-APIC. + * Pavel Machek and + * Mikael Pettersson : PM converted to driver model. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +unsigned int num_processors; + +unsigned disabled_cpus __cpuinitdata; + +/* Processor that is doing the boot up */ +unsigned int boot_cpu_physical_apicid = -1U; + +/* + * The highest APIC ID seen during enumeration. + * + * This determines the messaging protocol we can use: if all APIC IDs + * are in the 0 ... 7 range, then we can use logical addressing which + * has some performance advantages (better broadcasting). + * + * If there's an APIC ID above 8, we use physical addressing. + */ +unsigned int max_physical_apicid; + +/* + * Bitmask of physically existing CPUs: + */ +physid_mask_t phys_cpu_present_map; + +/* + * Map cpu index to physical APIC ID + */ +DEFINE_EARLY_PER_CPU(u16, x86_cpu_to_apicid, BAD_APICID); +DEFINE_EARLY_PER_CPU(u16, x86_bios_cpu_apicid, BAD_APICID); +EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid); +EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); + +#ifdef CONFIG_X86_32 +/* + * Knob to control our willingness to enable the local APIC. + * + * +1=force-enable + */ +static int force_enable_local_apic; +/* + * APIC command line parameters + */ +static int __init parse_lapic(char *arg) +{ + force_enable_local_apic = 1; + return 0; +} +early_param("lapic", parse_lapic); +/* Local APIC was disabled by the BIOS and enabled by the kernel */ +static int enabled_via_apicbase; + +#endif + +#ifdef CONFIG_X86_64 +static int apic_calibrate_pmtmr __initdata; +static __init int setup_apicpmtimer(char *s) +{ + apic_calibrate_pmtmr = 1; + notsc_setup(NULL); + return 0; +} +__setup("apicpmtimer", setup_apicpmtimer); +#endif + +#ifdef CONFIG_X86_X2APIC +int x2apic; +/* x2apic enabled before OS handover */ +static int x2apic_preenabled; +static int disable_x2apic; +static __init int setup_nox2apic(char *str) +{ + disable_x2apic = 1; + setup_clear_cpu_cap(X86_FEATURE_X2APIC); + return 0; +} +early_param("nox2apic", setup_nox2apic); +#endif + +unsigned long mp_lapic_addr; +int disable_apic; +/* Disable local APIC timer from the kernel commandline or via dmi quirk */ +static int disable_apic_timer __cpuinitdata; +/* Local APIC timer works in C2 */ +int local_apic_timer_c2_ok; +EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok); + +int first_system_vector = 0xfe; + +/* + * Debug level, exported for io_apic.c + */ +unsigned int apic_verbosity; + +int pic_mode; + +/* Have we found an MP table */ +int smp_found_config; + +static struct resource lapic_resource = { + .name = "Local APIC", + .flags = IORESOURCE_MEM | IORESOURCE_BUSY, +}; + +static unsigned int calibration_result; + +static int lapic_next_event(unsigned long delta, + struct clock_event_device *evt); +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt); +static void lapic_timer_broadcast(const struct cpumask *mask); +static void apic_pm_activate(void); + +/* + * The local apic timer can be used for any function which is CPU local. + */ +static struct clock_event_device lapic_clockevent = { + .name = "lapic", + .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT + | CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_DUMMY, + .shift = 32, + .set_mode = lapic_timer_setup, + .set_next_event = lapic_next_event, + .broadcast = lapic_timer_broadcast, + .rating = 100, + .irq = -1, +}; +static DEFINE_PER_CPU(struct clock_event_device, lapic_events); + +static unsigned long apic_phys; + +/* + * Get the LAPIC version + */ +static inline int lapic_get_version(void) +{ + return GET_APIC_VERSION(apic_read(APIC_LVR)); +} + +/* + * Check, if the APIC is integrated or a separate chip + */ +static inline int lapic_is_integrated(void) +{ +#ifdef CONFIG_X86_64 + return 1; +#else + return APIC_INTEGRATED(lapic_get_version()); +#endif +} + +/* + * Check, whether this is a modern or a first generation APIC + */ +static int modern_apic(void) +{ + /* AMD systems use old APIC versions, so check the CPU */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && + boot_cpu_data.x86 >= 0xf) + return 1; + return lapic_get_version() >= 0x14; +} + +void native_apic_wait_icr_idle(void) +{ + while (apic_read(APIC_ICR) & APIC_ICR_BUSY) + cpu_relax(); +} + +u32 native_safe_apic_wait_icr_idle(void) +{ + u32 send_status; + int timeout; + + timeout = 0; + do { + send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY; + if (!send_status) + break; + udelay(100); + } while (timeout++ < 1000); + + return send_status; +} + +void native_apic_icr_write(u32 low, u32 id) +{ + apic_write(APIC_ICR2, SET_APIC_DEST_FIELD(id)); + apic_write(APIC_ICR, low); +} + +u64 native_apic_icr_read(void) +{ + u32 icr1, icr2; + + icr2 = apic_read(APIC_ICR2); + icr1 = apic_read(APIC_ICR); + + return icr1 | ((u64)icr2 << 32); +} + +/** + * enable_NMI_through_LVT0 - enable NMI through local vector table 0 + */ +void __cpuinit enable_NMI_through_LVT0(void) +{ + unsigned int v; + + /* unmask and set to NMI */ + v = APIC_DM_NMI; + + /* Level triggered for 82489DX (32bit mode) */ + if (!lapic_is_integrated()) + v |= APIC_LVT_LEVEL_TRIGGER; + + apic_write(APIC_LVT0, v); +} + +#ifdef CONFIG_X86_32 +/** + * get_physical_broadcast - Get number of physical broadcast IDs + */ +int get_physical_broadcast(void) +{ + return modern_apic() ? 0xff : 0xf; +} +#endif + +/** + * lapic_get_maxlvt - get the maximum number of local vector table entries + */ +int lapic_get_maxlvt(void) +{ + unsigned int v; + + v = apic_read(APIC_LVR); + /* + * - we always have APIC integrated on 64bit mode + * - 82489DXs do not report # of LVT entries + */ + return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2; +} + +/* + * Local APIC timer + */ + +/* Clock divisor */ +#define APIC_DIVISOR 16 + +/* + * This function sets up the local APIC timer, with a timeout of + * 'clocks' APIC bus clock. During calibration we actually call + * this function twice on the boot CPU, once with a bogus timeout + * value, second time for real. The other (noncalibrating) CPUs + * call this function only once, with the real, calibrated value. + * + * We do reads before writes even if unnecessary, to get around the + * P5 APIC double write bug. + */ +static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) +{ + unsigned int lvtt_value, tmp_value; + + lvtt_value = LOCAL_TIMER_VECTOR; + if (!oneshot) + lvtt_value |= APIC_LVT_TIMER_PERIODIC; + if (!lapic_is_integrated()) + lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV); + + if (!irqen) + lvtt_value |= APIC_LVT_MASKED; + + apic_write(APIC_LVTT, lvtt_value); + + /* + * Divide PICLK by 16 + */ + tmp_value = apic_read(APIC_TDCR); + apic_write(APIC_TDCR, + (tmp_value & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) | + APIC_TDR_DIV_16); + + if (!oneshot) + apic_write(APIC_TMICT, clocks / APIC_DIVISOR); +} + +/* + * Setup extended LVT, AMD specific (K8, family 10h) + * + * Vector mappings are hard coded. On K8 only offset 0 (APIC500) and + * MCE interrupts are supported. Thus MCE offset must be set to 0. + * + * If mask=1, the LVT entry does not generate interrupts while mask=0 + * enables the vector. See also the BKDGs. + */ + +#define APIC_EILVT_LVTOFF_MCE 0 +#define APIC_EILVT_LVTOFF_IBS 1 + +static void setup_APIC_eilvt(u8 lvt_off, u8 vector, u8 msg_type, u8 mask) +{ + unsigned long reg = (lvt_off << 4) + APIC_EILVT0; + unsigned int v = (mask << 16) | (msg_type << 8) | vector; + + apic_write(reg, v); +} + +u8 setup_APIC_eilvt_mce(u8 vector, u8 msg_type, u8 mask) +{ + setup_APIC_eilvt(APIC_EILVT_LVTOFF_MCE, vector, msg_type, mask); + return APIC_EILVT_LVTOFF_MCE; +} + +u8 setup_APIC_eilvt_ibs(u8 vector, u8 msg_type, u8 mask) +{ + setup_APIC_eilvt(APIC_EILVT_LVTOFF_IBS, vector, msg_type, mask); + return APIC_EILVT_LVTOFF_IBS; +} +EXPORT_SYMBOL_GPL(setup_APIC_eilvt_ibs); + +/* + * Program the next event, relative to now + */ +static int lapic_next_event(unsigned long delta, + struct clock_event_device *evt) +{ + apic_write(APIC_TMICT, delta); + return 0; +} + +/* + * Setup the lapic timer in periodic or oneshot mode + */ +static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt) +{ + unsigned long flags; + unsigned int v; + + /* Lapic used as dummy for broadcast ? */ + if (evt->features & CLOCK_EVT_FEAT_DUMMY) + return; + + local_irq_save(flags); + + switch (mode) { + case CLOCK_EVT_MODE_PERIODIC: + case CLOCK_EVT_MODE_ONESHOT: + __setup_APIC_LVTT(calibration_result, + mode != CLOCK_EVT_MODE_PERIODIC, 1); + break; + case CLOCK_EVT_MODE_UNUSED: + case CLOCK_EVT_MODE_SHUTDOWN: + v = apic_read(APIC_LVTT); + v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); + apic_write(APIC_LVTT, v); + apic_write(APIC_TMICT, 0xffffffff); + break; + case CLOCK_EVT_MODE_RESUME: + /* Nothing to do here */ + break; + } + + local_irq_restore(flags); +} + +/* + * Local APIC timer broadcast function + */ +static void lapic_timer_broadcast(const struct cpumask *mask) +{ +#ifdef CONFIG_SMP + apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR); +#endif +} + +/* + * Setup the local APIC timer for this CPU. Copy the initilized values + * of the boot CPU and register the clock event in the framework. + */ +static void __cpuinit setup_APIC_timer(void) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + + memcpy(levt, &lapic_clockevent, sizeof(*levt)); + levt->cpumask = cpumask_of(smp_processor_id()); + + clockevents_register_device(levt); +} + +/* + * In this functions we calibrate APIC bus clocks to the external timer. + * + * We want to do the calibration only once since we want to have local timer + * irqs syncron. CPUs connected by the same APIC bus have the very same bus + * frequency. + * + * This was previously done by reading the PIT/HPET and waiting for a wrap + * around to find out, that a tick has elapsed. I have a box, where the PIT + * readout is broken, so it never gets out of the wait loop again. This was + * also reported by others. + * + * Monitoring the jiffies value is inaccurate and the clockevents + * infrastructure allows us to do a simple substitution of the interrupt + * handler. + * + * The calibration routine also uses the pm_timer when possible, as the PIT + * happens to run way too slow (factor 2.3 on my VAIO CoreDuo, which goes + * back to normal later in the boot process). + */ + +#define LAPIC_CAL_LOOPS (HZ/10) + +static __initdata int lapic_cal_loops = -1; +static __initdata long lapic_cal_t1, lapic_cal_t2; +static __initdata unsigned long long lapic_cal_tsc1, lapic_cal_tsc2; +static __initdata unsigned long lapic_cal_pm1, lapic_cal_pm2; +static __initdata unsigned long lapic_cal_j1, lapic_cal_j2; + +/* + * Temporary interrupt handler. + */ +static void __init lapic_cal_handler(struct clock_event_device *dev) +{ + unsigned long long tsc = 0; + long tapic = apic_read(APIC_TMCCT); + unsigned long pm = acpi_pm_read_early(); + + if (cpu_has_tsc) + rdtscll(tsc); + + switch (lapic_cal_loops++) { + case 0: + lapic_cal_t1 = tapic; + lapic_cal_tsc1 = tsc; + lapic_cal_pm1 = pm; + lapic_cal_j1 = jiffies; + break; + + case LAPIC_CAL_LOOPS: + lapic_cal_t2 = tapic; + lapic_cal_tsc2 = tsc; + if (pm < lapic_cal_pm1) + pm += ACPI_PM_OVRRUN; + lapic_cal_pm2 = pm; + lapic_cal_j2 = jiffies; + break; + } +} + +static int __init +calibrate_by_pmtimer(long deltapm, long *delta, long *deltatsc) +{ + const long pm_100ms = PMTMR_TICKS_PER_SEC / 10; + const long pm_thresh = pm_100ms / 100; + unsigned long mult; + u64 res; + +#ifndef CONFIG_X86_PM_TIMER + return -1; +#endif + + apic_printk(APIC_VERBOSE, "... PM-Timer delta = %ld\n", deltapm); + + /* Check, if the PM timer is available */ + if (!deltapm) + return -1; + + mult = clocksource_hz2mult(PMTMR_TICKS_PER_SEC, 22); + + if (deltapm > (pm_100ms - pm_thresh) && + deltapm < (pm_100ms + pm_thresh)) { + apic_printk(APIC_VERBOSE, "... PM-Timer result ok\n"); + return 0; + } + + res = (((u64)deltapm) * mult) >> 22; + do_div(res, 1000000); + pr_warning("APIC calibration not consistent " + "with PM-Timer: %ldms instead of 100ms\n",(long)res); + + /* Correct the lapic counter value */ + res = (((u64)(*delta)) * pm_100ms); + do_div(res, deltapm); + pr_info("APIC delta adjusted to PM-Timer: " + "%lu (%ld)\n", (unsigned long)res, *delta); + *delta = (long)res; + + /* Correct the tsc counter value */ + if (cpu_has_tsc) { + res = (((u64)(*deltatsc)) * pm_100ms); + do_div(res, deltapm); + apic_printk(APIC_VERBOSE, "TSC delta adjusted to " + "PM-Timer: %lu (%ld) \n", + (unsigned long)res, *deltatsc); + *deltatsc = (long)res; + } + + return 0; +} + +static int __init calibrate_APIC_clock(void) +{ + struct clock_event_device *levt = &__get_cpu_var(lapic_events); + void (*real_handler)(struct clock_event_device *dev); + unsigned long deltaj; + long delta, deltatsc; + int pm_referenced = 0; + + local_irq_disable(); + + /* Replace the global interrupt handler */ + real_handler = global_clock_event->event_handler; + global_clock_event->event_handler = lapic_cal_handler; + + /* + * Setup the APIC counter to maximum. There is no way the lapic + * can underflow in the 100ms detection time frame + */ + __setup_APIC_LVTT(0xffffffff, 0, 0); + + /* Let the interrupts run */ + local_irq_enable(); + + while (lapic_cal_loops <= LAPIC_CAL_LOOPS) + cpu_relax(); + + local_irq_disable(); + + /* Restore the real event handler */ + global_clock_event->event_handler = real_handler; + + /* Build delta t1-t2 as apic timer counts down */ + delta = lapic_cal_t1 - lapic_cal_t2; + apic_printk(APIC_VERBOSE, "... lapic delta = %ld\n", delta); + + deltatsc = (long)(lapic_cal_tsc2 - lapic_cal_tsc1); + + /* we trust the PM based calibration if possible */ + pm_referenced = !calibrate_by_pmtimer(lapic_cal_pm2 - lapic_cal_pm1, + &delta, &deltatsc); + + /* Calculate the scaled math multiplication factor */ + lapic_clockevent.mult = div_sc(delta, TICK_NSEC * LAPIC_CAL_LOOPS, + lapic_clockevent.shift); + lapic_clockevent.max_delta_ns = + clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); + lapic_clockevent.min_delta_ns = + clockevent_delta2ns(0xF, &lapic_clockevent); + + calibration_result = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; + + apic_printk(APIC_VERBOSE, "..... delta %ld\n", delta); + apic_printk(APIC_VERBOSE, "..... mult: %ld\n", lapic_clockevent.mult); + apic_printk(APIC_VERBOSE, "..... calibration result: %u\n", + calibration_result); + + if (cpu_has_tsc) { + apic_printk(APIC_VERBOSE, "..... CPU clock speed is " + "%ld.%04ld MHz.\n", + (deltatsc / LAPIC_CAL_LOOPS) / (1000000 / HZ), + (deltatsc / LAPIC_CAL_LOOPS) % (1000000 / HZ)); + } + + apic_printk(APIC_VERBOSE, "..... host bus clock speed is " + "%u.%04u MHz.\n", + calibration_result / (1000000 / HZ), + calibration_result % (1000000 / HZ)); + + /* + * Do a sanity check on the APIC calibration result + */ + if (calibration_result < (1000000 / HZ)) { + local_irq_enable(); + pr_warning("APIC frequency too slow, disabling apic timer\n"); + return -1; + } + + levt->features &= ~CLOCK_EVT_FEAT_DUMMY; + + /* + * PM timer calibration failed or not turned on + * so lets try APIC timer based calibration + */ + if (!pm_referenced) { + apic_printk(APIC_VERBOSE, "... verify APIC timer\n"); + + /* + * Setup the apic timer manually + */ + levt->event_handler = lapic_cal_handler; + lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt); + lapic_cal_loops = -1; + + /* Let the interrupts run */ + local_irq_enable(); + + while (lapic_cal_loops <= LAPIC_CAL_LOOPS) + cpu_relax(); + + /* Stop the lapic timer */ + lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt); + + /* Jiffies delta */ + deltaj = lapic_cal_j2 - lapic_cal_j1; + apic_printk(APIC_VERBOSE, "... jiffies delta = %lu\n", deltaj); + + /* Check, if the jiffies result is consistent */ + if (deltaj >= LAPIC_CAL_LOOPS-2 && deltaj <= LAPIC_CAL_LOOPS+2) + apic_printk(APIC_VERBOSE, "... jiffies result ok\n"); + else + levt->features |= CLOCK_EVT_FEAT_DUMMY; + } else + local_irq_enable(); + + if (levt->features & CLOCK_EVT_FEAT_DUMMY) { + pr_warning("APIC timer disabled due to verification failure\n"); + return -1; + } + + return 0; +} + +/* + * Setup the boot APIC + * + * Calibrate and verify the result. + */ +void __init setup_boot_APIC_clock(void) +{ + /* + * The local apic timer can be disabled via the kernel + * commandline or from the CPU detection code. Register the lapic + * timer as a dummy clock event source on SMP systems, so the + * broadcast mechanism is used. On UP systems simply ignore it. + */ + if (disable_apic_timer) { + pr_info("Disabling APIC timer\n"); + /* No broadcast on UP ! */ + if (num_possible_cpus() > 1) { + lapic_clockevent.mult = 1; + setup_APIC_timer(); + } + return; + } + + apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n" + "calibrating APIC timer ...\n"); + + if (calibrate_APIC_clock()) { + /* No broadcast on UP ! */ + if (num_possible_cpus() > 1) + setup_APIC_timer(); + return; + } + + /* + * If nmi_watchdog is set to IO_APIC, we need the + * PIT/HPET going. Otherwise register lapic as a dummy + * device. + */ + if (nmi_watchdog != NMI_IO_APIC) + lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY; + else + pr_warning("APIC timer registered as dummy," + " due to nmi_watchdog=%d!\n", nmi_watchdog); + + /* Setup the lapic or request the broadcast */ + setup_APIC_timer(); +} + +void __cpuinit setup_secondary_APIC_clock(void) +{ + setup_APIC_timer(); +} + +/* + * The guts of the apic timer interrupt + */ +static void local_apic_timer_interrupt(void) +{ + int cpu = smp_processor_id(); + struct clock_event_device *evt = &per_cpu(lapic_events, cpu); + + /* + * Normally we should not be here till LAPIC has been initialized but + * in some cases like kdump, its possible that there is a pending LAPIC + * timer interrupt from previous kernel's context and is delivered in + * new kernel the moment interrupts are enabled. + * + * Interrupts are enabled early and LAPIC is setup much later, hence + * its possible that when we get here evt->event_handler is NULL. + * Check for event_handler being NULL and discard the interrupt as + * spurious. + */ + if (!evt->event_handler) { + pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu); + /* Switch it off */ + lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); + return; + } + + /* + * the NMI deadlock-detector uses this. + */ + inc_irq_stat(apic_timer_irqs); + + evt->event_handler(evt); + + perf_counter_unthrottle(); +} + +/* + * Local APIC timer interrupt. This is the most natural way for doing + * local interrupts, but local timer interrupts can be emulated by + * broadcast interrupts too. [in case the hw doesn't support APIC timers] + * + * [ if a single-CPU system runs an SMP kernel then we call the local + * interrupt as well. Thus we cannot inline the local irq ... ] + */ +void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + /* + * NOTE! We'd better ACK the irq immediately, + * because timer handling can be slow. + */ + ack_APIC_irq(); + /* + * update_process_times() expects us to have done irq_enter(). + * Besides, if we don't timer interrupts ignore the global + * interrupt lock, which is the WrongThing (tm) to do. + */ + exit_idle(); + irq_enter(); + local_apic_timer_interrupt(); + irq_exit(); + + set_irq_regs(old_regs); +} + +int setup_profiling_timer(unsigned int multiplier) +{ + return -EINVAL; +} + +/* + * Local APIC start and shutdown + */ + +/** + * clear_local_APIC - shutdown the local APIC + * + * This is called, when a CPU is disabled and before rebooting, so the state of + * the local APIC has no dangling leftovers. Also used to cleanout any BIOS + * leftovers during boot. + */ +void clear_local_APIC(void) +{ + int maxlvt; + u32 v; + + /* APIC hasn't been mapped yet */ + if (!apic_phys) + return; + + maxlvt = lapic_get_maxlvt(); + /* + * Masking an LVT entry can trigger a local APIC error + * if the vector is zero. Mask LVTERR first to prevent this. + */ + if (maxlvt >= 3) { + v = ERROR_APIC_VECTOR; /* any non-zero vector will do */ + apic_write(APIC_LVTERR, v | APIC_LVT_MASKED); + } + /* + * Careful: we have to set masks only first to deassert + * any level-triggered sources. + */ + v = apic_read(APIC_LVTT); + apic_write(APIC_LVTT, v | APIC_LVT_MASKED); + v = apic_read(APIC_LVT0); + apic_write(APIC_LVT0, v | APIC_LVT_MASKED); + v = apic_read(APIC_LVT1); + apic_write(APIC_LVT1, v | APIC_LVT_MASKED); + if (maxlvt >= 4) { + v = apic_read(APIC_LVTPC); + apic_write(APIC_LVTPC, v | APIC_LVT_MASKED); + } + + /* lets not touch this if we didn't frob it */ +#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) + if (maxlvt >= 5) { + v = apic_read(APIC_LVTTHMR); + apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED); + } +#endif + /* + * Clean APIC state for other OSs: + */ + apic_write(APIC_LVTT, APIC_LVT_MASKED); + apic_write(APIC_LVT0, APIC_LVT_MASKED); + apic_write(APIC_LVT1, APIC_LVT_MASKED); + if (maxlvt >= 3) + apic_write(APIC_LVTERR, APIC_LVT_MASKED); + if (maxlvt >= 4) + apic_write(APIC_LVTPC, APIC_LVT_MASKED); + + /* Integrated APIC (!82489DX) ? */ + if (lapic_is_integrated()) { + if (maxlvt > 3) + /* Clear ESR due to Pentium errata 3AP and 11AP */ + apic_write(APIC_ESR, 0); + apic_read(APIC_ESR); + } +} + +/** + * disable_local_APIC - clear and disable the local APIC + */ +void disable_local_APIC(void) +{ + unsigned int value; + + /* APIC hasn't been mapped yet */ + if (!apic_phys) + return; + + clear_local_APIC(); + + /* + * Disable APIC (implies clearing of registers + * for 82489DX!). + */ + value = apic_read(APIC_SPIV); + value &= ~APIC_SPIV_APIC_ENABLED; + apic_write(APIC_SPIV, value); + +#ifdef CONFIG_X86_32 + /* + * When LAPIC was disabled by the BIOS and enabled by the kernel, + * restore the disabled state. + */ + if (enabled_via_apicbase) { + unsigned int l, h; + + rdmsr(MSR_IA32_APICBASE, l, h); + l &= ~MSR_IA32_APICBASE_ENABLE; + wrmsr(MSR_IA32_APICBASE, l, h); + } +#endif +} + +/* + * If Linux enabled the LAPIC against the BIOS default disable it down before + * re-entering the BIOS on shutdown. Otherwise the BIOS may get confused and + * not power-off. Additionally clear all LVT entries before disable_local_APIC + * for the case where Linux didn't enable the LAPIC. + */ +void lapic_shutdown(void) +{ + unsigned long flags; + + if (!cpu_has_apic) + return; + + local_irq_save(flags); + +#ifdef CONFIG_X86_32 + if (!enabled_via_apicbase) + clear_local_APIC(); + else +#endif + disable_local_APIC(); + + + local_irq_restore(flags); +} + +/* + * This is to verify that we're looking at a real local APIC. + * Check these against your board if the CPUs aren't getting + * started for no apparent reason. + */ +int __init verify_local_APIC(void) +{ + unsigned int reg0, reg1; + + /* + * The version register is read-only in a real APIC. + */ + reg0 = apic_read(APIC_LVR); + apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg0); + apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK); + reg1 = apic_read(APIC_LVR); + apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg1); + + /* + * The two version reads above should print the same + * numbers. If the second one is different, then we + * poke at a non-APIC. + */ + if (reg1 != reg0) + return 0; + + /* + * Check if the version looks reasonably. + */ + reg1 = GET_APIC_VERSION(reg0); + if (reg1 == 0x00 || reg1 == 0xff) + return 0; + reg1 = lapic_get_maxlvt(); + if (reg1 < 0x02 || reg1 == 0xff) + return 0; + + /* + * The ID register is read/write in a real APIC. + */ + reg0 = apic_read(APIC_ID); + apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg0); + apic_write(APIC_ID, reg0 ^ apic->apic_id_mask); + reg1 = apic_read(APIC_ID); + apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg1); + apic_write(APIC_ID, reg0); + if (reg1 != (reg0 ^ apic->apic_id_mask)) + return 0; + + /* + * The next two are just to see if we have sane values. + * They're only really relevant if we're in Virtual Wire + * compatibility mode, but most boxes are anymore. + */ + reg0 = apic_read(APIC_LVT0); + apic_printk(APIC_DEBUG, "Getting LVT0: %x\n", reg0); + reg1 = apic_read(APIC_LVT1); + apic_printk(APIC_DEBUG, "Getting LVT1: %x\n", reg1); + + return 1; +} + +/** + * sync_Arb_IDs - synchronize APIC bus arbitration IDs + */ +void __init sync_Arb_IDs(void) +{ + /* + * Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 And not + * needed on AMD. + */ + if (modern_apic() || boot_cpu_data.x86_vendor == X86_VENDOR_AMD) + return; + + /* + * Wait for idle. + */ + apic_wait_icr_idle(); + + apic_printk(APIC_DEBUG, "Synchronizing Arb IDs.\n"); + apic_write(APIC_ICR, APIC_DEST_ALLINC | + APIC_INT_LEVELTRIG | APIC_DM_INIT); +} + +/* + * An initial setup of the virtual wire mode. + */ +void __init init_bsp_APIC(void) +{ + unsigned int value; + + /* + * Don't do the setup now if we have a SMP BIOS as the + * through-I/O-APIC virtual wire mode might be active. + */ + if (smp_found_config || !cpu_has_apic) + return; + + /* + * Do not trust the local APIC being empty at bootup. + */ + clear_local_APIC(); + + /* + * Enable APIC. + */ + value = apic_read(APIC_SPIV); + value &= ~APIC_VECTOR_MASK; + value |= APIC_SPIV_APIC_ENABLED; + +#ifdef CONFIG_X86_32 + /* This bit is reserved on P4/Xeon and should be cleared */ + if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && + (boot_cpu_data.x86 == 15)) + value &= ~APIC_SPIV_FOCUS_DISABLED; + else +#endif + value |= APIC_SPIV_FOCUS_DISABLED; + value |= SPURIOUS_APIC_VECTOR; + apic_write(APIC_SPIV, value); + + /* + * Set up the virtual wire mode. + */ + apic_write(APIC_LVT0, APIC_DM_EXTINT); + value = APIC_DM_NMI; + if (!lapic_is_integrated()) /* 82489DX */ + value |= APIC_LVT_LEVEL_TRIGGER; + apic_write(APIC_LVT1, value); +} + +static void __cpuinit lapic_setup_esr(void) +{ + unsigned int oldvalue, value, maxlvt; + + if (!lapic_is_integrated()) { + pr_info("No ESR for 82489DX.\n"); + return; + } + + if (apic->disable_esr) { + /* + * Something untraceable is creating bad interrupts on + * secondary quads ... for the moment, just leave the + * ESR disabled - we can't do anything useful with the + * errors anyway - mbligh + */ + pr_info("Leaving ESR disabled.\n"); + return; + } + + maxlvt = lapic_get_maxlvt(); + if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */ + apic_write(APIC_ESR, 0); + oldvalue = apic_read(APIC_ESR); + + /* enables sending errors */ + value = ERROR_APIC_VECTOR; + apic_write(APIC_LVTERR, value); + + /* + * spec says clear errors after enabling vector. + */ + if (maxlvt > 3) + apic_write(APIC_ESR, 0); + value = apic_read(APIC_ESR); + if (value != oldvalue) + apic_printk(APIC_VERBOSE, "ESR value before enabling " + "vector: 0x%08x after: 0x%08x\n", + oldvalue, value); +} + + +/** + * setup_local_APIC - setup the local APIC + */ +void __cpuinit setup_local_APIC(void) +{ + unsigned int value; + int i, j; + + if (disable_apic) { + arch_disable_smp_support(); + return; + } + +#ifdef CONFIG_X86_32 + /* Pound the ESR really hard over the head with a big hammer - mbligh */ + if (lapic_is_integrated() && apic->disable_esr) { + apic_write(APIC_ESR, 0); + apic_write(APIC_ESR, 0); + apic_write(APIC_ESR, 0); + apic_write(APIC_ESR, 0); + } +#endif + perf_counters_lapic_init(0); + + preempt_disable(); + + /* + * Double-check whether this APIC is really registered. + * This is meaningless in clustered apic mode, so we skip it. + */ + if (!apic->apic_id_registered()) + BUG(); + + /* + * Intel recommends to set DFR, LDR and TPR before enabling + * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel + * document number 292116). So here it goes... + */ + apic->init_apic_ldr(); + + /* + * Set Task Priority to 'accept all'. We never change this + * later on. + */ + value = apic_read(APIC_TASKPRI); + value &= ~APIC_TPRI_MASK; + apic_write(APIC_TASKPRI, value); + + /* + * After a crash, we no longer service the interrupts and a pending + * interrupt from previous kernel might still have ISR bit set. + * + * Most probably by now CPU has serviced that pending interrupt and + * it might not have done the ack_APIC_irq() because it thought, + * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it + * does not clear the ISR bit and cpu thinks it has already serivced + * the interrupt. Hence a vector might get locked. It was noticed + * for timer irq (vector 0x31). Issue an extra EOI to clear ISR. + */ + for (i = APIC_ISR_NR - 1; i >= 0; i--) { + value = apic_read(APIC_ISR + i*0x10); + for (j = 31; j >= 0; j--) { + if (value & (1< 1) || + (boot_cpu_data.x86 >= 15)) + break; + goto no_apic; + case X86_VENDOR_INTEL: + if (boot_cpu_data.x86 == 6 || boot_cpu_data.x86 == 15 || + (boot_cpu_data.x86 == 5 && cpu_has_apic)) + break; + goto no_apic; + default: + goto no_apic; + } + + if (!cpu_has_apic) { + /* + * Over-ride BIOS and try to enable the local APIC only if + * "lapic" specified. + */ + if (!force_enable_local_apic) { + pr_info("Local APIC disabled by BIOS -- " + "you can enable it with \"lapic\"\n"); + return -1; + } + /* + * Some BIOSes disable the local APIC in the APIC_BASE + * MSR. This can only be done in software for Intel P6 or later + * and AMD K7 (Model > 1) or later. + */ + rdmsr(MSR_IA32_APICBASE, l, h); + if (!(l & MSR_IA32_APICBASE_ENABLE)) { + pr_info("Local APIC disabled by BIOS -- reenabling.\n"); + l &= ~MSR_IA32_APICBASE_BASE; + l |= MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE; + wrmsr(MSR_IA32_APICBASE, l, h); + enabled_via_apicbase = 1; + } + } + /* + * The APIC feature bit should now be enabled + * in `cpuid' + */ + features = cpuid_edx(1); + if (!(features & (1 << X86_FEATURE_APIC))) { + pr_warning("Could not enable APIC!\n"); + return -1; + } + set_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; + + /* The BIOS may have set up the APIC at some other address */ + rdmsr(MSR_IA32_APICBASE, l, h); + if (l & MSR_IA32_APICBASE_ENABLE) + mp_lapic_addr = l & MSR_IA32_APICBASE_BASE; + + pr_info("Found and enabled local APIC!\n"); + + apic_pm_activate(); + + return 0; + +no_apic: + pr_info("No local APIC present or hardware disabled\n"); + return -1; +} +#endif + +#ifdef CONFIG_X86_64 +void __init early_init_lapic_mapping(void) +{ + unsigned long phys_addr; + + /* + * If no local APIC can be found then go out + * : it means there is no mpatable and MADT + */ + if (!smp_found_config) + return; + + phys_addr = mp_lapic_addr; + + set_fixmap_nocache(FIX_APIC_BASE, phys_addr); + apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n", + APIC_BASE, phys_addr); + + /* + * Fetch the APIC ID of the BSP in case we have a + * default configuration (or the MP table is broken). + */ + boot_cpu_physical_apicid = read_apic_id(); +} +#endif + +/** + * init_apic_mappings - initialize APIC mappings + */ +void __init init_apic_mappings(void) +{ +#ifdef CONFIG_X86_X2APIC + if (x2apic) { + boot_cpu_physical_apicid = read_apic_id(); + return; + } +#endif + + /* + * If no local APIC can be found then set up a fake all + * zeroes page to simulate the local APIC and another + * one for the IO-APIC. + */ + if (!smp_found_config && detect_init_APIC()) { + apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE); + apic_phys = __pa(apic_phys); + } else + apic_phys = mp_lapic_addr; + + set_fixmap_nocache(FIX_APIC_BASE, apic_phys); + apic_printk(APIC_VERBOSE, "mapped APIC to %08lx (%08lx)\n", + APIC_BASE, apic_phys); + + /* + * Fetch the APIC ID of the BSP in case we have a + * default configuration (or the MP table is broken). + */ + if (boot_cpu_physical_apicid == -1U) + boot_cpu_physical_apicid = read_apic_id(); +} + +/* + * This initializes the IO-APIC and APIC hardware if this is + * a UP kernel. + */ +int apic_version[MAX_APICS]; + +int __init APIC_init_uniprocessor(void) +{ + if (disable_apic) { + pr_info("Apic disabled\n"); + return -1; + } +#ifdef CONFIG_X86_64 + if (!cpu_has_apic) { + disable_apic = 1; + pr_info("Apic disabled by BIOS\n"); + return -1; + } +#else + if (!smp_found_config && !cpu_has_apic) + return -1; + + /* + * Complain if the BIOS pretends there is one. + */ + if (!cpu_has_apic && + APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) { + pr_err("BIOS bug, local APIC 0x%x not detected!...\n", + boot_cpu_physical_apicid); + clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + return -1; + } +#endif + + enable_IR_x2apic(); +#ifdef CONFIG_X86_64 + default_setup_apic_routing(); +#endif + + verify_local_APIC(); + connect_bsp_APIC(); + +#ifdef CONFIG_X86_64 + apic_write(APIC_ID, SET_APIC_ID(boot_cpu_physical_apicid)); +#else + /* + * Hack: In case of kdump, after a crash, kernel might be booting + * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid + * might be zero if read from MP tables. Get it from LAPIC. + */ +# ifdef CONFIG_CRASH_DUMP + boot_cpu_physical_apicid = read_apic_id(); +# endif +#endif + physid_set_mask_of_physid(boot_cpu_physical_apicid, &phys_cpu_present_map); + setup_local_APIC(); + +#ifdef CONFIG_X86_IO_APIC + /* + * Now enable IO-APICs, actually call clear_IO_APIC + * We need clear_IO_APIC before enabling error vector + */ + if (!skip_ioapic_setup && nr_ioapics) + enable_IO_APIC(); +#endif + + end_local_APIC_setup(); + +#ifdef CONFIG_X86_IO_APIC + if (smp_found_config && !skip_ioapic_setup && nr_ioapics) + setup_IO_APIC(); + else { + nr_ioapics = 0; + localise_nmi_watchdog(); + } +#else + localise_nmi_watchdog(); +#endif + + setup_boot_clock(); +#ifdef CONFIG_X86_64 + check_nmi_watchdog(); +#endif + + return 0; +} + +/* + * Local APIC interrupts + */ + +/* + * This interrupt should _never_ happen with our APIC/SMP architecture + */ +void smp_spurious_interrupt(struct pt_regs *regs) +{ + u32 v; + + exit_idle(); + irq_enter(); + /* + * Check if this really is a spurious interrupt and ACK it + * if it is a vectored one. Just in case... + * Spurious interrupts should not be ACKed. + */ + v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1)); + if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f))) + ack_APIC_irq(); + + inc_irq_stat(irq_spurious_count); + + /* see sw-dev-man vol 3, chapter 7.4.13.5 */ + pr_info("spurious APIC interrupt on CPU#%d, " + "should never happen.\n", smp_processor_id()); + irq_exit(); +} + +/* + * This interrupt should never happen with our APIC/SMP architecture + */ +void smp_error_interrupt(struct pt_regs *regs) +{ + u32 v, v1; + + exit_idle(); + irq_enter(); + /* First tickle the hardware, only then report what went on. -- REW */ + v = apic_read(APIC_ESR); + apic_write(APIC_ESR, 0); + v1 = apic_read(APIC_ESR); + ack_APIC_irq(); + atomic_inc(&irq_err_count); + + /* + * Here is what the APIC error bits mean: + * 0: Send CS error + * 1: Receive CS error + * 2: Send accept error + * 3: Receive accept error + * 4: Reserved + * 5: Send illegal vector + * 6: Received illegal vector + * 7: Illegal register address + */ + pr_debug("APIC error on CPU%d: %02x(%02x)\n", + smp_processor_id(), v , v1); + irq_exit(); +} + +/** + * connect_bsp_APIC - attach the APIC to the interrupt system + */ +void __init connect_bsp_APIC(void) +{ +#ifdef CONFIG_X86_32 + if (pic_mode) { + /* + * Do not trust the local APIC being empty at bootup. + */ + clear_local_APIC(); + /* + * PIC mode, enable APIC mode in the IMCR, i.e. connect BSP's + * local APIC to INT and NMI lines. + */ + apic_printk(APIC_VERBOSE, "leaving PIC mode, " + "enabling APIC mode.\n"); + outb(0x70, 0x22); + outb(0x01, 0x23); + } +#endif + if (apic->enable_apic_mode) + apic->enable_apic_mode(); +} + +/** + * disconnect_bsp_APIC - detach the APIC from the interrupt system + * @virt_wire_setup: indicates, whether virtual wire mode is selected + * + * Virtual wire mode is necessary to deliver legacy interrupts even when the + * APIC is disabled. + */ +void disconnect_bsp_APIC(int virt_wire_setup) +{ + unsigned int value; + +#ifdef CONFIG_X86_32 + if (pic_mode) { + /* + * Put the board back into PIC mode (has an effect only on + * certain older boards). Note that APIC interrupts, including + * IPIs, won't work beyond this point! The only exception are + * INIT IPIs. + */ + apic_printk(APIC_VERBOSE, "disabling APIC mode, " + "entering PIC mode.\n"); + outb(0x70, 0x22); + outb(0x00, 0x23); + return; + } +#endif + + /* Go back to Virtual Wire compatibility mode */ + + /* For the spurious interrupt use vector F, and enable it */ + value = apic_read(APIC_SPIV); + value &= ~APIC_VECTOR_MASK; + value |= APIC_SPIV_APIC_ENABLED; + value |= 0xf; + apic_write(APIC_SPIV, value); + + if (!virt_wire_setup) { + /* + * For LVT0 make it edge triggered, active high, + * external and enabled + */ + value = apic_read(APIC_LVT0); + value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | + APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | + APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); + value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; + value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); + apic_write(APIC_LVT0, value); + } else { + /* Disable LVT0 */ + apic_write(APIC_LVT0, APIC_LVT_MASKED); + } + + /* + * For LVT1 make it edge triggered, active high, + * nmi and enabled + */ + value = apic_read(APIC_LVT1); + value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | + APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | + APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); + value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; + value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); + apic_write(APIC_LVT1, value); +} + +void __cpuinit generic_processor_info(int apicid, int version) +{ + int cpu; + + /* + * Validate version + */ + if (version == 0x0) { + pr_warning("BIOS bug, APIC version is 0 for CPU#%d! " + "fixing up to 0x10. (tell your hw vendor)\n", + version); + version = 0x10; + } + apic_version[apicid] = version; + + if (num_processors >= nr_cpu_ids) { + int max = nr_cpu_ids; + int thiscpu = max + disabled_cpus; + + pr_warning( + "ACPI: NR_CPUS/possible_cpus limit of %i reached." + " Processor %d/0x%x ignored.\n", max, thiscpu, apicid); + + disabled_cpus++; + return; + } + + num_processors++; + cpu = cpumask_next_zero(-1, cpu_present_mask); + + if (version != apic_version[boot_cpu_physical_apicid]) + WARN_ONCE(1, + "ACPI: apic version mismatch, bootcpu: %x cpu %d: %x\n", + apic_version[boot_cpu_physical_apicid], cpu, version); + + physid_set(apicid, phys_cpu_present_map); + if (apicid == boot_cpu_physical_apicid) { + /* + * x86_bios_cpu_apicid is required to have processors listed + * in same order as logical cpu numbers. Hence the first + * entry is BSP, and so on. + */ + cpu = 0; + } + if (apicid > max_physical_apicid) + max_physical_apicid = apicid; + +#ifdef CONFIG_X86_32 + /* + * Would be preferable to switch to bigsmp when CONFIG_HOTPLUG_CPU=y + * but we need to work other dependencies like SMP_SUSPEND etc + * before this can be done without some confusion. + * if (CPU_HOTPLUG_ENABLED || num_processors > 8) + * - Ashok Raj + */ + if (max_physical_apicid >= 8) { + switch (boot_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + if (!APIC_XAPIC(version)) { + def_to_bigsmp = 0; + break; + } + /* If P4 and above fall through */ + case X86_VENDOR_AMD: + def_to_bigsmp = 1; + } + } +#endif + +#if defined(CONFIG_SMP) || defined(CONFIG_X86_64) + early_per_cpu(x86_cpu_to_apicid, cpu) = apicid; + early_per_cpu(x86_bios_cpu_apicid, cpu) = apicid; +#endif + + set_cpu_possible(cpu, true); + set_cpu_present(cpu, true); +} + +int hard_smp_processor_id(void) +{ + return read_apic_id(); +} + +void default_init_apic_ldr(void) +{ + unsigned long val; + + apic_write(APIC_DFR, APIC_DFR_VALUE); + val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; + val |= SET_APIC_LOGICAL_ID(1UL << smp_processor_id()); + apic_write(APIC_LDR, val); +} + +#ifdef CONFIG_X86_32 +int default_apicid_to_node(int logical_apicid) +{ +#ifdef CONFIG_SMP + return apicid_2_node[hard_smp_processor_id()]; +#else + return 0; +#endif +} +#endif + +/* + * Power management + */ +#ifdef CONFIG_PM + +static struct { + /* + * 'active' is true if the local APIC was enabled by us and + * not the BIOS; this signifies that we are also responsible + * for disabling it before entering apm/acpi suspend + */ + int active; + /* r/w apic fields */ + unsigned int apic_id; + unsigned int apic_taskpri; + unsigned int apic_ldr; + unsigned int apic_dfr; + unsigned int apic_spiv; + unsigned int apic_lvtt; + unsigned int apic_lvtpc; + unsigned int apic_lvt0; + unsigned int apic_lvt1; + unsigned int apic_lvterr; + unsigned int apic_tmict; + unsigned int apic_tdcr; + unsigned int apic_thmr; +} apic_pm_state; + +static int lapic_suspend(struct sys_device *dev, pm_message_t state) +{ + unsigned long flags; + int maxlvt; + + if (!apic_pm_state.active) + return 0; + + maxlvt = lapic_get_maxlvt(); + + apic_pm_state.apic_id = apic_read(APIC_ID); + apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); + apic_pm_state.apic_ldr = apic_read(APIC_LDR); + apic_pm_state.apic_dfr = apic_read(APIC_DFR); + apic_pm_state.apic_spiv = apic_read(APIC_SPIV); + apic_pm_state.apic_lvtt = apic_read(APIC_LVTT); + if (maxlvt >= 4) + apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC); + apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); + apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); + apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); + apic_pm_state.apic_tmict = apic_read(APIC_TMICT); + apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); +#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) + if (maxlvt >= 5) + apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); +#endif + + local_irq_save(flags); + disable_local_APIC(); + local_irq_restore(flags); + return 0; +} + +static int lapic_resume(struct sys_device *dev) +{ + unsigned int l, h; + unsigned long flags; + int maxlvt; + + if (!apic_pm_state.active) + return 0; + + maxlvt = lapic_get_maxlvt(); + + local_irq_save(flags); + +#ifdef CONFIG_X86_X2APIC + if (x2apic) + enable_x2apic(); + else +#endif + { + /* + * Make sure the APICBASE points to the right address + * + * FIXME! This will be wrong if we ever support suspend on + * SMP! We'll need to do this as part of the CPU restore! + */ + rdmsr(MSR_IA32_APICBASE, l, h); + l &= ~MSR_IA32_APICBASE_BASE; + l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr; + wrmsr(MSR_IA32_APICBASE, l, h); + } + + apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED); + apic_write(APIC_ID, apic_pm_state.apic_id); + apic_write(APIC_DFR, apic_pm_state.apic_dfr); + apic_write(APIC_LDR, apic_pm_state.apic_ldr); + apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri); + apic_write(APIC_SPIV, apic_pm_state.apic_spiv); + apic_write(APIC_LVT0, apic_pm_state.apic_lvt0); + apic_write(APIC_LVT1, apic_pm_state.apic_lvt1); +#if defined(CONFIG_X86_MCE_P4THERMAL) || defined(CONFIG_X86_MCE_INTEL) + if (maxlvt >= 5) + apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr); +#endif + if (maxlvt >= 4) + apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); + apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); + apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); + apic_write(APIC_TMICT, apic_pm_state.apic_tmict); + apic_write(APIC_ESR, 0); + apic_read(APIC_ESR); + apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); + apic_write(APIC_ESR, 0); + apic_read(APIC_ESR); + + local_irq_restore(flags); + + return 0; +} + +/* + * This device has no shutdown method - fully functioning local APICs + * are needed on every CPU up until machine_halt/restart/poweroff. + */ + +static struct sysdev_class lapic_sysclass = { + .name = "lapic", + .resume = lapic_resume, + .suspend = lapic_suspend, +}; + +static struct sys_device device_lapic = { + .id = 0, + .cls = &lapic_sysclass, +}; + +static void __cpuinit apic_pm_activate(void) +{ + apic_pm_state.active = 1; +} + +static int __init init_lapic_sysfs(void) +{ + int error; + + if (!cpu_has_apic) + return 0; + /* XXX: remove suspend/resume procs if !apic_pm_state.active? */ + + error = sysdev_class_register(&lapic_sysclass); + if (!error) + error = sysdev_register(&device_lapic); + return error; +} +device_initcall(init_lapic_sysfs); + +#else /* CONFIG_PM */ + +static void apic_pm_activate(void) { } + +#endif /* CONFIG_PM */ + +#ifdef CONFIG_X86_64 +/* + * apic_is_clustered_box() -- Check if we can expect good TSC + * + * Thus far, the major user of this is IBM's Summit2 series: + * + * Clustered boxes may have unsynced TSC problems if they are + * multi-chassis. Use available data to take a good guess. + * If in doubt, go HPET. + */ +__cpuinit int apic_is_clustered_box(void) +{ + int i, clusters, zeros; + unsigned id; + u16 *bios_cpu_apicid; + DECLARE_BITMAP(clustermap, NUM_APIC_CLUSTERS); + + /* + * there is not this kind of box with AMD CPU yet. + * Some AMD box with quadcore cpu and 8 sockets apicid + * will be [4, 0x23] or [8, 0x27] could be thought to + * vsmp box still need checking... + */ + if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && !is_vsmp_box()) + return 0; + + bios_cpu_apicid = early_per_cpu_ptr(x86_bios_cpu_apicid); + bitmap_zero(clustermap, NUM_APIC_CLUSTERS); + + for (i = 0; i < nr_cpu_ids; i++) { + /* are we being called early in kernel startup? */ + if (bios_cpu_apicid) { + id = bios_cpu_apicid[i]; + } else if (i < nr_cpu_ids) { + if (cpu_present(i)) + id = per_cpu(x86_bios_cpu_apicid, i); + else + continue; + } else + break; + + if (id != BAD_APICID) + __set_bit(APIC_CLUSTERID(id), clustermap); + } + + /* Problem: Partially populated chassis may not have CPUs in some of + * the APIC clusters they have been allocated. Only present CPUs have + * x86_bios_cpu_apicid entries, thus causing zeroes in the bitmap. + * Since clusters are allocated sequentially, count zeros only if + * they are bounded by ones. + */ + clusters = 0; + zeros = 0; + for (i = 0; i < NUM_APIC_CLUSTERS; i++) { + if (test_bit(i, clustermap)) { + clusters += 1 + zeros; + zeros = 0; + } else + ++zeros; + } + + /* ScaleMP vSMPowered boxes have one cluster per board and TSCs are + * not guaranteed to be synced between boards + */ + if (is_vsmp_box() && clusters > 1) + return 1; + + /* + * If clusters > 2, then should be multi-chassis. + * May have to revisit this when multi-core + hyperthreaded CPUs come + * out, but AFAIK this will work even for them. + */ + return (clusters > 2); +} +#endif + +/* + * APIC command line parameters + */ +static int __init setup_disableapic(char *arg) +{ + disable_apic = 1; + setup_clear_cpu_cap(X86_FEATURE_APIC); + return 0; +} +early_param("disableapic", setup_disableapic); + +/* same as disableapic, for compatibility */ +static int __init setup_nolapic(char *arg) +{ + return setup_disableapic(arg); +} +early_param("nolapic", setup_nolapic); + +static int __init parse_lapic_timer_c2_ok(char *arg) +{ + local_apic_timer_c2_ok = 1; + return 0; +} +early_param("lapic_timer_c2_ok", parse_lapic_timer_c2_ok); + +static int __init parse_disable_apic_timer(char *arg) +{ + disable_apic_timer = 1; + return 0; +} +early_param("noapictimer", parse_disable_apic_timer); + +static int __init parse_nolapic_timer(char *arg) +{ + disable_apic_timer = 1; + return 0; +} +early_param("nolapic_timer", parse_nolapic_timer); + +static int __init apic_set_verbosity(char *arg) +{ + if (!arg) { +#ifdef CONFIG_X86_64 + skip_ioapic_setup = 0; + return 0; +#endif + return -EINVAL; + } + + if (strcmp("debug", arg) == 0) + apic_verbosity = APIC_DEBUG; + else if (strcmp("verbose", arg) == 0) + apic_verbosity = APIC_VERBOSE; + else { + pr_warning("APIC Verbosity level %s not recognised" + " use apic=verbose or apic=debug\n", arg); + return -EINVAL; + } + + return 0; +} +early_param("apic", apic_set_verbosity); + +static int __init lapic_insert_resource(void) +{ + if (!apic_phys) + return -1; + + /* Put local APIC into the resource map. */ + lapic_resource.start = apic_phys; + lapic_resource.end = lapic_resource.start + PAGE_SIZE - 1; + insert_resource(&iomem_resource, &lapic_resource); + + return 0; +} + +/* + * need call insert after e820_reserve_resources() + * that is using request_resource + */ +late_initcall(lapic_insert_resource); Index: linux-2.6-tip/arch/x86/kernel/apic/apic_flat_64.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/apic_flat_64.c @@ -0,0 +1,389 @@ +/* + * Copyright 2004 James Cleverdon, IBM. + * Subject to the GNU Public License, v.2 + * + * Flat APIC subarch code. + * + * Hacked for x86-64 by James Cleverdon from i386 architecture code by + * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and + * James Cleverdon. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_ACPI +#include +#endif + +static int flat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + return 1; +} + +static const struct cpumask *flat_target_cpus(void) +{ + return cpu_online_mask; +} + +static void flat_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + /* Careful. Some cpus do not strictly honor the set of cpus + * specified in the interrupt destination when using lowest + * priority interrupt delivery mode. + * + * In particular there was a hyperthreading cpu observed to + * deliver interrupts to the wrong hyperthread when only one + * hyperthread was specified in the interrupt desitination. + */ + cpumask_clear(retmask); + cpumask_bits(retmask)[0] = APIC_ALL_CPUS; +} + +/* + * Set up the logical destination ID. + * + * Intel recommends to set DFR, LDR and TPR before enabling + * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel + * document number 292116). So here it goes... + */ +static void flat_init_apic_ldr(void) +{ + unsigned long val; + unsigned long num, id; + + num = smp_processor_id(); + id = 1UL << num; + apic_write(APIC_DFR, APIC_DFR_FLAT); + val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; + val |= SET_APIC_LOGICAL_ID(id); + apic_write(APIC_LDR, val); +} + +static inline void _flat_send_IPI_mask(unsigned long mask, int vector) +{ + unsigned long flags; + + local_irq_save(flags); + __default_send_IPI_dest_field(mask, vector, apic->dest_logical); + local_irq_restore(flags); +} + +static void flat_send_IPI_mask(const struct cpumask *cpumask, int vector) +{ + unsigned long mask = cpumask_bits(cpumask)[0]; + + _flat_send_IPI_mask(mask, vector); +} + +static void + flat_send_IPI_mask_allbutself(const struct cpumask *cpumask, int vector) +{ + unsigned long mask = cpumask_bits(cpumask)[0]; + int cpu = smp_processor_id(); + + if (cpu < BITS_PER_LONG) + clear_bit(cpu, &mask); + + _flat_send_IPI_mask(mask, vector); +} + +static void flat_send_IPI_allbutself(int vector) +{ + int cpu = smp_processor_id(); +#ifdef CONFIG_HOTPLUG_CPU + int hotplug = 1; +#else + int hotplug = 0; +#endif + if (hotplug || vector == NMI_VECTOR) { + if (!cpumask_equal(cpu_online_mask, cpumask_of(cpu))) { + unsigned long mask = cpumask_bits(cpu_online_mask)[0]; + + if (cpu < BITS_PER_LONG) + clear_bit(cpu, &mask); + + _flat_send_IPI_mask(mask, vector); + } + } else if (num_online_cpus() > 1) { + __default_send_IPI_shortcut(APIC_DEST_ALLBUT, + vector, apic->dest_logical); + } +} + +static void flat_send_IPI_all(int vector) +{ + if (vector == NMI_VECTOR) { + flat_send_IPI_mask(cpu_online_mask, vector); + } else { + __default_send_IPI_shortcut(APIC_DEST_ALLINC, + vector, apic->dest_logical); + } +} + +static unsigned int flat_get_apic_id(unsigned long x) +{ + unsigned int id; + + id = (((x)>>24) & 0xFFu); + + return id; +} + +static unsigned long set_apic_id(unsigned int id) +{ + unsigned long x; + + x = ((id & 0xFFu)<<24); + return x; +} + +static unsigned int read_xapic_id(void) +{ + unsigned int id; + + id = flat_get_apic_id(apic_read(APIC_ID)); + return id; +} + +static int flat_apic_id_registered(void) +{ + return physid_isset(read_xapic_id(), phys_cpu_present_map); +} + +static unsigned int flat_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + return cpumask_bits(cpumask)[0] & APIC_ALL_CPUS; +} + +static unsigned int flat_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + unsigned long mask1 = cpumask_bits(cpumask)[0] & APIC_ALL_CPUS; + unsigned long mask2 = cpumask_bits(andmask)[0] & APIC_ALL_CPUS; + + return mask1 & mask2; +} + +static int flat_phys_pkg_id(int initial_apic_id, int index_msb) +{ + return hard_smp_processor_id() >> index_msb; +} + +struct apic apic_flat = { + .name = "flat", + .probe = NULL, + .acpi_madt_oem_check = flat_acpi_madt_oem_check, + .apic_id_registered = flat_apic_id_registered, + + .irq_delivery_mode = dest_LowestPrio, + .irq_dest_mode = 1, /* logical */ + + .target_cpus = flat_target_cpus, + .disable_esr = 0, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = NULL, + .check_apicid_present = NULL, + + .vector_allocation_domain = flat_vector_allocation_domain, + .init_apic_ldr = flat_init_apic_ldr, + + .ioapic_phys_id_map = NULL, + .setup_apic_routing = NULL, + .multi_timer_check = NULL, + .apicid_to_node = NULL, + .cpu_to_logical_apicid = NULL, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = NULL, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = flat_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = flat_get_apic_id, + .set_apic_id = set_apic_id, + .apic_id_mask = 0xFFu << 24, + + .cpu_mask_to_apicid = flat_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = flat_cpu_mask_to_apicid_and, + + .send_IPI_mask = flat_send_IPI_mask, + .send_IPI_mask_allbutself = flat_send_IPI_mask_allbutself, + .send_IPI_allbutself = flat_send_IPI_allbutself, + .send_IPI_all = flat_send_IPI_all, + .send_IPI_self = apic_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + .wait_for_init_deassert = NULL, + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = NULL, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; + +/* + * Physflat mode is used when there are more than 8 CPUs on a AMD system. + * We cannot use logical delivery in this case because the mask + * overflows, so use physical mode. + */ +static int physflat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ +#ifdef CONFIG_ACPI + /* + * Quirk: some x86_64 machines can only use physical APIC mode + * regardless of how many processors are present (x86_64 ES7000 + * is an example). + */ + if (acpi_gbl_FADT.header.revision > FADT2_REVISION_ID && + (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL)) { + printk(KERN_DEBUG "system APIC only can use physical flat"); + return 1; + } +#endif + + return 0; +} + +static const struct cpumask *physflat_target_cpus(void) +{ + return cpu_online_mask; +} + +static void physflat_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + cpumask_clear(retmask); + cpumask_set_cpu(cpu, retmask); +} + +static void physflat_send_IPI_mask(const struct cpumask *cpumask, int vector) +{ + default_send_IPI_mask_sequence_phys(cpumask, vector); +} + +static void physflat_send_IPI_mask_allbutself(const struct cpumask *cpumask, + int vector) +{ + default_send_IPI_mask_allbutself_phys(cpumask, vector); +} + +static void physflat_send_IPI_allbutself(int vector) +{ + default_send_IPI_mask_allbutself_phys(cpu_online_mask, vector); +} + +static void physflat_send_IPI_all(int vector) +{ + physflat_send_IPI_mask(cpu_online_mask, vector); +} + +static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + cpu = cpumask_first(cpumask); + if ((unsigned)cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + else + return BAD_APICID; +} + +static unsigned int +physflat_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + for_each_cpu_and(cpu, cpumask, andmask) { + if (cpumask_test_cpu(cpu, cpu_online_mask)) + break; + } + if (cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + + return BAD_APICID; +} + +struct apic apic_physflat = { + + .name = "physical flat", + .probe = NULL, + .acpi_madt_oem_check = physflat_acpi_madt_oem_check, + .apic_id_registered = flat_apic_id_registered, + + .irq_delivery_mode = dest_Fixed, + .irq_dest_mode = 0, /* physical */ + + .target_cpus = physflat_target_cpus, + .disable_esr = 0, + .dest_logical = 0, + .check_apicid_used = NULL, + .check_apicid_present = NULL, + + .vector_allocation_domain = physflat_vector_allocation_domain, + /* not needed, but shouldn't hurt: */ + .init_apic_ldr = flat_init_apic_ldr, + + .ioapic_phys_id_map = NULL, + .setup_apic_routing = NULL, + .multi_timer_check = NULL, + .apicid_to_node = NULL, + .cpu_to_logical_apicid = NULL, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = NULL, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = flat_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = flat_get_apic_id, + .set_apic_id = set_apic_id, + .apic_id_mask = 0xFFu << 24, + + .cpu_mask_to_apicid = physflat_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = physflat_cpu_mask_to_apicid_and, + + .send_IPI_mask = physflat_send_IPI_mask, + .send_IPI_mask_allbutself = physflat_send_IPI_mask_allbutself, + .send_IPI_allbutself = physflat_send_IPI_allbutself, + .send_IPI_all = physflat_send_IPI_all, + .send_IPI_self = apic_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + .wait_for_init_deassert = NULL, + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = NULL, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/bigsmp_32.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/bigsmp_32.c @@ -0,0 +1,274 @@ +/* + * APIC driver for "bigsmp" xAPIC machines with more than 8 virtual CPUs. + * + * Drives the local APIC in "clustered mode". + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +static inline unsigned bigsmp_get_apic_id(unsigned long x) +{ + return (x >> 24) & 0xFF; +} + +static inline int bigsmp_apic_id_registered(void) +{ + return 1; +} + +static inline const cpumask_t *bigsmp_target_cpus(void) +{ +#ifdef CONFIG_SMP + return &cpu_online_map; +#else + return &cpumask_of_cpu(0); +#endif +} + +static inline unsigned long +bigsmp_check_apicid_used(physid_mask_t bitmap, int apicid) +{ + return 0; +} + +static inline unsigned long bigsmp_check_apicid_present(int bit) +{ + return 1; +} + +static inline unsigned long calculate_ldr(int cpu) +{ + unsigned long val, id; + + val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; + id = per_cpu(x86_bios_cpu_apicid, cpu); + val |= SET_APIC_LOGICAL_ID(id); + + return val; +} + +/* + * Set up the logical destination ID. + * + * Intel recommends to set DFR, LDR and TPR before enabling + * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel + * document number 292116). So here it goes... + */ +static inline void bigsmp_init_apic_ldr(void) +{ + unsigned long val; + int cpu = smp_processor_id(); + + apic_write(APIC_DFR, APIC_DFR_FLAT); + val = calculate_ldr(cpu); + apic_write(APIC_LDR, val); +} + +static inline void bigsmp_setup_apic_routing(void) +{ + printk(KERN_INFO + "Enabling APIC mode: Physflat. Using %d I/O APICs\n", + nr_ioapics); +} + +static inline int bigsmp_apicid_to_node(int logical_apicid) +{ + return apicid_2_node[hard_smp_processor_id()]; +} + +static inline int bigsmp_cpu_present_to_apicid(int mps_cpu) +{ + if (mps_cpu < nr_cpu_ids) + return (int) per_cpu(x86_bios_cpu_apicid, mps_cpu); + + return BAD_APICID; +} + +static inline physid_mask_t bigsmp_apicid_to_cpu_present(int phys_apicid) +{ + return physid_mask_of_physid(phys_apicid); +} + +/* Mapping from cpu number to logical apicid */ +static inline int bigsmp_cpu_to_logical_apicid(int cpu) +{ + if (cpu >= nr_cpu_ids) + return BAD_APICID; + return cpu_physical_id(cpu); +} + +static inline physid_mask_t bigsmp_ioapic_phys_id_map(physid_mask_t phys_map) +{ + /* For clustered we don't have a good way to do this yet - hack */ + return physids_promote(0xFFL); +} + +static inline void bigsmp_setup_portio_remap(void) +{ +} + +static inline int bigsmp_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return 1; +} + +/* As we are using single CPU as destination, pick only one CPU here */ +static inline unsigned int bigsmp_cpu_mask_to_apicid(const cpumask_t *cpumask) +{ + return bigsmp_cpu_to_logical_apicid(first_cpu(*cpumask)); +} + +static inline unsigned int +bigsmp_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + for_each_cpu_and(cpu, cpumask, andmask) { + if (cpumask_test_cpu(cpu, cpu_online_mask)) + break; + } + if (cpu < nr_cpu_ids) + return bigsmp_cpu_to_logical_apicid(cpu); + + return BAD_APICID; +} + +static inline int bigsmp_phys_pkg_id(int cpuid_apic, int index_msb) +{ + return cpuid_apic >> index_msb; +} + +static inline void bigsmp_send_IPI_mask(const struct cpumask *mask, int vector) +{ + default_send_IPI_mask_sequence_phys(mask, vector); +} + +static inline void bigsmp_send_IPI_allbutself(int vector) +{ + default_send_IPI_mask_allbutself_phys(cpu_online_mask, vector); +} + +static inline void bigsmp_send_IPI_all(int vector) +{ + bigsmp_send_IPI_mask(cpu_online_mask, vector); +} + +static int dmi_bigsmp; /* can be set by dmi scanners */ + +static int hp_ht_bigsmp(const struct dmi_system_id *d) +{ + printk(KERN_NOTICE "%s detected: force use of apic=bigsmp\n", d->ident); + dmi_bigsmp = 1; + + return 0; +} + + +static const struct dmi_system_id bigsmp_dmi_table[] = { + { hp_ht_bigsmp, "HP ProLiant DL760 G2", + { DMI_MATCH(DMI_BIOS_VENDOR, "HP"), + DMI_MATCH(DMI_BIOS_VERSION, "P44-"), + } + }, + + { hp_ht_bigsmp, "HP ProLiant DL740", + { DMI_MATCH(DMI_BIOS_VENDOR, "HP"), + DMI_MATCH(DMI_BIOS_VERSION, "P47-"), + } + }, + { } /* NULL entry stops DMI scanning */ +}; + +static void bigsmp_vector_allocation_domain(int cpu, cpumask_t *retmask) +{ + cpus_clear(*retmask); + cpu_set(cpu, *retmask); +} + +static int probe_bigsmp(void) +{ + if (def_to_bigsmp) + dmi_bigsmp = 1; + else + dmi_check_system(bigsmp_dmi_table); + + return dmi_bigsmp; +} + +struct apic apic_bigsmp = { + + .name = "bigsmp", + .probe = probe_bigsmp, + .acpi_madt_oem_check = NULL, + .apic_id_registered = bigsmp_apic_id_registered, + + .irq_delivery_mode = dest_Fixed, + /* phys delivery to target CPU: */ + .irq_dest_mode = 0, + + .target_cpus = bigsmp_target_cpus, + .disable_esr = 1, + .dest_logical = 0, + .check_apicid_used = bigsmp_check_apicid_used, + .check_apicid_present = bigsmp_check_apicid_present, + + .vector_allocation_domain = bigsmp_vector_allocation_domain, + .init_apic_ldr = bigsmp_init_apic_ldr, + + .ioapic_phys_id_map = bigsmp_ioapic_phys_id_map, + .setup_apic_routing = bigsmp_setup_apic_routing, + .multi_timer_check = NULL, + .apicid_to_node = bigsmp_apicid_to_node, + .cpu_to_logical_apicid = bigsmp_cpu_to_logical_apicid, + .cpu_present_to_apicid = bigsmp_cpu_present_to_apicid, + .apicid_to_cpu_present = bigsmp_apicid_to_cpu_present, + .setup_portio_remap = NULL, + .check_phys_apicid_present = bigsmp_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = bigsmp_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = bigsmp_get_apic_id, + .set_apic_id = NULL, + .apic_id_mask = 0xFF << 24, + + .cpu_mask_to_apicid = bigsmp_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = bigsmp_cpu_mask_to_apicid_and, + + .send_IPI_mask = bigsmp_send_IPI_mask, + .send_IPI_mask_allbutself = NULL, + .send_IPI_allbutself = bigsmp_send_IPI_allbutself, + .send_IPI_all = bigsmp_send_IPI_all, + .send_IPI_self = default_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + + .wait_for_init_deassert = default_wait_for_init_deassert, + + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = default_inquire_remote_apic, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/es7000_32.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/es7000_32.c @@ -0,0 +1,757 @@ +/* + * Written by: Garry Forsgren, Unisys Corporation + * Natalie Protasevich, Unisys Corporation + * + * This file contains the code to configure and interface + * with Unisys ES7000 series hardware system manager. + * + * Copyright (c) 2003 Unisys Corporation. + * Copyright (C) 2009, Red Hat, Inc., Ingo Molnar + * + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public License along + * with this program; if not, write the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston MA 02111-1307, USA. + * + * Contact information: Unisys Corporation, Township Line & Union Meeting + * Roads-A, Unisys Way, Blue Bell, Pennsylvania, 19424, or: + * + * http://www.unisys.com + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +/* + * ES7000 chipsets + */ + +#define NON_UNISYS 0 +#define ES7000_CLASSIC 1 +#define ES7000_ZORRO 2 + +#define MIP_REG 1 +#define MIP_PSAI_REG 4 + +#define MIP_BUSY 1 +#define MIP_SPIN 0xf0000 +#define MIP_VALID 0x0100000000000000ULL +#define MIP_SW_APIC 0x1020b + +#define MIP_PORT(val) ((val >> 32) & 0xffff) + +#define MIP_RD_LO(val) (val & 0xffffffff) + +struct mip_reg { + unsigned long long off_0x00; + unsigned long long off_0x08; + unsigned long long off_0x10; + unsigned long long off_0x18; + unsigned long long off_0x20; + unsigned long long off_0x28; + unsigned long long off_0x30; + unsigned long long off_0x38; +}; + +struct mip_reg_info { + unsigned long long mip_info; + unsigned long long delivery_info; + unsigned long long host_reg; + unsigned long long mip_reg; +}; + +struct psai { + unsigned long long entry_type; + unsigned long long addr; + unsigned long long bep_addr; +}; + +#ifdef CONFIG_ACPI + +struct es7000_oem_table { + struct acpi_table_header Header; + u32 OEMTableAddr; + u32 OEMTableSize; +}; + +static unsigned long oem_addrX; +static unsigned long oem_size; + +#endif + +/* + * ES7000 Globals + */ + +static volatile unsigned long *psai; +static struct mip_reg *mip_reg; +static struct mip_reg *host_reg; +static int mip_port; +static unsigned long mip_addr; +static unsigned long host_addr; + +int es7000_plat; + +/* + * GSI override for ES7000 platforms. + */ + +static unsigned int base; + +static int +es7000_rename_gsi(int ioapic, int gsi) +{ + if (es7000_plat == ES7000_ZORRO) + return gsi; + + if (!base) { + int i; + for (i = 0; i < nr_ioapics; i++) + base += nr_ioapic_registers[i]; + } + + if (!ioapic && (gsi < 16)) + gsi += base; + + return gsi; +} + +static int wakeup_secondary_cpu_via_mip(int cpu, unsigned long eip) +{ + unsigned long vect = 0, psaival = 0; + + if (psai == NULL) + return -1; + + vect = ((unsigned long)__pa(eip)/0x1000) << 16; + psaival = (0x1000000 | vect | cpu); + + while (*psai & 0x1000000) + ; + + *psai = psaival; + + return 0; +} + +static int __init es7000_update_apic(void) +{ + apic->wakeup_cpu = wakeup_secondary_cpu_via_mip; + + /* MPENTIUMIII */ + if (boot_cpu_data.x86 == 6 && + (boot_cpu_data.x86_model >= 7 || boot_cpu_data.x86_model <= 11)) { + es7000_update_apic_to_cluster(); + apic->wait_for_init_deassert = NULL; + apic->wakeup_cpu = wakeup_secondary_cpu_via_mip; + } + + return 0; +} + +static void __init setup_unisys(void) +{ + /* + * Determine the generation of the ES7000 currently running. + * + * es7000_plat = 1 if the machine is a 5xx ES7000 box + * es7000_plat = 2 if the machine is a x86_64 ES7000 box + * + */ + if (!(boot_cpu_data.x86 <= 15 && boot_cpu_data.x86_model <= 2)) + es7000_plat = ES7000_ZORRO; + else + es7000_plat = ES7000_CLASSIC; + ioapic_renumber_irq = es7000_rename_gsi; + + x86_quirks->update_apic = es7000_update_apic; +} + +/* + * Parse the OEM Table: + */ +static int __init parse_unisys_oem(char *oemptr) +{ + int i; + int success = 0; + unsigned char type, size; + unsigned long val; + char *tp = NULL; + struct psai *psaip = NULL; + struct mip_reg_info *mi; + struct mip_reg *host, *mip; + + tp = oemptr; + + tp += 8; + + for (i = 0; i <= 6; i++) { + type = *tp++; + size = *tp++; + tp -= 2; + switch (type) { + case MIP_REG: + mi = (struct mip_reg_info *)tp; + val = MIP_RD_LO(mi->host_reg); + host_addr = val; + host = (struct mip_reg *)val; + host_reg = __va(host); + val = MIP_RD_LO(mi->mip_reg); + mip_port = MIP_PORT(mi->mip_info); + mip_addr = val; + mip = (struct mip_reg *)val; + mip_reg = __va(mip); + pr_debug("es7000_mipcfg: host_reg = 0x%lx \n", + (unsigned long)host_reg); + pr_debug("es7000_mipcfg: mip_reg = 0x%lx \n", + (unsigned long)mip_reg); + success++; + break; + case MIP_PSAI_REG: + psaip = (struct psai *)tp; + if (tp != NULL) { + if (psaip->addr) + psai = __va(psaip->addr); + else + psai = NULL; + success++; + } + break; + default: + break; + } + tp += size; + } + + if (success < 2) + es7000_plat = NON_UNISYS; + else + setup_unisys(); + + return es7000_plat; +} + +#ifdef CONFIG_ACPI +static int __init find_unisys_acpi_oem_table(unsigned long *oem_addr) +{ + struct acpi_table_header *header = NULL; + struct es7000_oem_table *table; + acpi_size tbl_size; + acpi_status ret; + int i = 0; + + for (;;) { + ret = acpi_get_table_with_size("OEM1", i++, &header, &tbl_size); + if (!ACPI_SUCCESS(ret)) + return -1; + + if (!memcmp((char *) &header->oem_id, "UNISYS", 6)) + break; + + early_acpi_os_unmap_memory(header, tbl_size); + } + + table = (void *)header; + + oem_addrX = table->OEMTableAddr; + oem_size = table->OEMTableSize; + + early_acpi_os_unmap_memory(header, tbl_size); + + *oem_addr = (unsigned long)__acpi_map_table(oem_addrX, oem_size); + + return 0; +} + +static void __init unmap_unisys_acpi_oem_table(unsigned long oem_addr) +{ + if (!oem_addr) + return; + + __acpi_unmap_table((char *)oem_addr, oem_size); +} + +static int es7000_check_dsdt(void) +{ + struct acpi_table_header header; + + if (ACPI_SUCCESS(acpi_get_table_header(ACPI_SIG_DSDT, 0, &header)) && + !strncmp(header.oem_id, "UNISYS", 6)) + return 1; + return 0; +} + +/* Hook from generic ACPI tables.c */ +static int __init es7000_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + unsigned long oem_addr = 0; + int check_dsdt; + int ret = 0; + + /* check dsdt at first to avoid clear fix_map for oem_addr */ + check_dsdt = es7000_check_dsdt(); + + if (!find_unisys_acpi_oem_table(&oem_addr)) { + if (check_dsdt) { + ret = parse_unisys_oem((char *)oem_addr); + } else { + setup_unisys(); + ret = 1; + } + /* + * we need to unmap it + */ + unmap_unisys_acpi_oem_table(oem_addr); + } + return ret; +} +#else /* !CONFIG_ACPI: */ +static int __init es7000_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + return 0; +} +#endif /* !CONFIG_ACPI */ + +static void es7000_spin(int n) +{ + int i = 0; + + while (i++ < n) + rep_nop(); +} + +static int __init +es7000_mip_write(struct mip_reg *mip_reg) +{ + int status = 0; + int spin; + + spin = MIP_SPIN; + while ((host_reg->off_0x38 & MIP_VALID) != 0) { + if (--spin <= 0) { + WARN(1, "Timeout waiting for Host Valid Flag\n"); + return -1; + } + es7000_spin(MIP_SPIN); + } + + memcpy(host_reg, mip_reg, sizeof(struct mip_reg)); + outb(1, mip_port); + + spin = MIP_SPIN; + + while ((mip_reg->off_0x38 & MIP_VALID) == 0) { + if (--spin <= 0) { + WARN(1, "Timeout waiting for MIP Valid Flag\n"); + return -1; + } + es7000_spin(MIP_SPIN); + } + + status = (mip_reg->off_0x00 & 0xffff0000000000ULL) >> 48; + mip_reg->off_0x38 &= ~MIP_VALID; + + return status; +} + +static void __init es7000_enable_apic_mode(void) +{ + struct mip_reg es7000_mip_reg; + int mip_status; + + if (!es7000_plat) + return; + + printk(KERN_INFO "ES7000: Enabling APIC mode.\n"); + memset(&es7000_mip_reg, 0, sizeof(struct mip_reg)); + es7000_mip_reg.off_0x00 = MIP_SW_APIC; + es7000_mip_reg.off_0x38 = MIP_VALID; + + while ((mip_status = es7000_mip_write(&es7000_mip_reg)) != 0) + WARN(1, "Command failed, status = %x\n", mip_status); +} + +static void es7000_vector_allocation_domain(int cpu, cpumask_t *retmask) +{ + /* Careful. Some cpus do not strictly honor the set of cpus + * specified in the interrupt destination when using lowest + * priority interrupt delivery mode. + * + * In particular there was a hyperthreading cpu observed to + * deliver interrupts to the wrong hyperthread when only one + * hyperthread was specified in the interrupt desitination. + */ + *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; +} + + +static void es7000_wait_for_init_deassert(atomic_t *deassert) +{ +#ifndef CONFIG_ES7000_CLUSTERED_APIC + while (!atomic_read(deassert)) + cpu_relax(); +#endif + return; +} + +static unsigned int es7000_get_apic_id(unsigned long x) +{ + return (x >> 24) & 0xFF; +} + +static void es7000_send_IPI_mask(const struct cpumask *mask, int vector) +{ + default_send_IPI_mask_sequence_phys(mask, vector); +} + +static void es7000_send_IPI_allbutself(int vector) +{ + default_send_IPI_mask_allbutself_phys(cpu_online_mask, vector); +} + +static void es7000_send_IPI_all(int vector) +{ + es7000_send_IPI_mask(cpu_online_mask, vector); +} + +static int es7000_apic_id_registered(void) +{ + return 1; +} + +static const cpumask_t *target_cpus_cluster(void) +{ + return &CPU_MASK_ALL; +} + +static const cpumask_t *es7000_target_cpus(void) +{ + return &cpumask_of_cpu(smp_processor_id()); +} + +static unsigned long +es7000_check_apicid_used(physid_mask_t bitmap, int apicid) +{ + return 0; +} +static unsigned long es7000_check_apicid_present(int bit) +{ + return physid_isset(bit, phys_cpu_present_map); +} + +static unsigned long calculate_ldr(int cpu) +{ + unsigned long id = per_cpu(x86_bios_cpu_apicid, cpu); + + return SET_APIC_LOGICAL_ID(id); +} + +/* + * Set up the logical destination ID. + * + * Intel recommends to set DFR, LdR and TPR before enabling + * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel + * document number 292116). So here it goes... + */ +static void es7000_init_apic_ldr_cluster(void) +{ + unsigned long val; + int cpu = smp_processor_id(); + + apic_write(APIC_DFR, APIC_DFR_CLUSTER); + val = calculate_ldr(cpu); + apic_write(APIC_LDR, val); +} + +static void es7000_init_apic_ldr(void) +{ + unsigned long val; + int cpu = smp_processor_id(); + + apic_write(APIC_DFR, APIC_DFR_FLAT); + val = calculate_ldr(cpu); + apic_write(APIC_LDR, val); +} + +static void es7000_setup_apic_routing(void) +{ + int apic = per_cpu(x86_bios_cpu_apicid, smp_processor_id()); + + printk(KERN_INFO + "Enabling APIC mode: %s. Using %d I/O APICs, target cpus %lx\n", + (apic_version[apic] == 0x14) ? + "Physical Cluster" : "Logical Cluster", + nr_ioapics, cpus_addr(*es7000_target_cpus())[0]); +} + +static int es7000_apicid_to_node(int logical_apicid) +{ + return 0; +} + + +static int es7000_cpu_present_to_apicid(int mps_cpu) +{ + if (!mps_cpu) + return boot_cpu_physical_apicid; + else if (mps_cpu < nr_cpu_ids) + return per_cpu(x86_bios_cpu_apicid, mps_cpu); + else + return BAD_APICID; +} + +static int cpu_id; + +static physid_mask_t es7000_apicid_to_cpu_present(int phys_apicid) +{ + physid_mask_t mask; + + mask = physid_mask_of_physid(cpu_id); + ++cpu_id; + + return mask; +} + +/* Mapping from cpu number to logical apicid */ +static int es7000_cpu_to_logical_apicid(int cpu) +{ +#ifdef CONFIG_SMP + if (cpu >= nr_cpu_ids) + return BAD_APICID; + return cpu_2_logical_apicid[cpu]; +#else + return logical_smp_processor_id(); +#endif +} + +static physid_mask_t es7000_ioapic_phys_id_map(physid_mask_t phys_map) +{ + /* For clustered we don't have a good way to do this yet - hack */ + return physids_promote(0xff); +} + +static int es7000_check_phys_apicid_present(int cpu_physical_apicid) +{ + boot_cpu_physical_apicid = read_apic_id(); + return 1; +} + +static unsigned int +es7000_cpu_mask_to_apicid_cluster(const struct cpumask *cpumask) +{ + int cpus_found = 0; + int num_bits_set; + int apicid; + int cpu; + + num_bits_set = cpumask_weight(cpumask); + /* Return id to all */ + if (num_bits_set == nr_cpu_ids) + return 0xFF; + /* + * The cpus in the mask must all be on the apic cluster. If are not + * on the same apicid cluster return default value of target_cpus(): + */ + cpu = cpumask_first(cpumask); + apicid = es7000_cpu_to_logical_apicid(cpu); + + while (cpus_found < num_bits_set) { + if (cpumask_test_cpu(cpu, cpumask)) { + int new_apicid = es7000_cpu_to_logical_apicid(cpu); + + if (APIC_CLUSTER(apicid) != APIC_CLUSTER(new_apicid)) { + WARN(1, "Not a valid mask!"); + + return 0xFF; + } + apicid = new_apicid; + cpus_found++; + } + cpu++; + } + return apicid; +} + +static unsigned int es7000_cpu_mask_to_apicid(const cpumask_t *cpumask) +{ + int cpus_found = 0; + int num_bits_set; + int apicid; + int cpu; + + num_bits_set = cpus_weight(*cpumask); + /* Return id to all */ + if (num_bits_set == nr_cpu_ids) + return es7000_cpu_to_logical_apicid(0); + /* + * The cpus in the mask must all be on the apic cluster. If are not + * on the same apicid cluster return default value of target_cpus(): + */ + cpu = first_cpu(*cpumask); + apicid = es7000_cpu_to_logical_apicid(cpu); + while (cpus_found < num_bits_set) { + if (cpu_isset(cpu, *cpumask)) { + int new_apicid = es7000_cpu_to_logical_apicid(cpu); + + if (APIC_CLUSTER(apicid) != APIC_CLUSTER(new_apicid)) { + printk("%s: Not a valid mask!\n", __func__); + + return es7000_cpu_to_logical_apicid(0); + } + apicid = new_apicid; + cpus_found++; + } + cpu++; + } + return apicid; +} + +static unsigned int +es7000_cpu_mask_to_apicid_and(const struct cpumask *inmask, + const struct cpumask *andmask) +{ + int apicid = es7000_cpu_to_logical_apicid(0); + cpumask_var_t cpumask; + + if (!alloc_cpumask_var(&cpumask, GFP_ATOMIC)) + return apicid; + + cpumask_and(cpumask, inmask, andmask); + cpumask_and(cpumask, cpumask, cpu_online_mask); + apicid = es7000_cpu_mask_to_apicid(cpumask); + + free_cpumask_var(cpumask); + + return apicid; +} + +static int es7000_phys_pkg_id(int cpuid_apic, int index_msb) +{ + return cpuid_apic >> index_msb; +} + +void __init es7000_update_apic_to_cluster(void) +{ + apic->target_cpus = target_cpus_cluster; + apic->irq_delivery_mode = dest_LowestPrio; + /* logical delivery broadcast to all procs: */ + apic->irq_dest_mode = 1; + + apic->init_apic_ldr = es7000_init_apic_ldr_cluster; + + apic->cpu_mask_to_apicid = es7000_cpu_mask_to_apicid_cluster; +} + +static int probe_es7000(void) +{ + /* probed later in mptable/ACPI hooks */ + return 0; +} + +static __init int +es7000_mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) +{ + if (mpc->oemptr) { + struct mpc_oemtable *oem_table = + (struct mpc_oemtable *)mpc->oemptr; + + if (!strncmp(oem, "UNISYS", 6)) + return parse_unisys_oem((char *)oem_table); + } + return 0; +} + + +struct apic apic_es7000 = { + + .name = "es7000", + .probe = probe_es7000, + .acpi_madt_oem_check = es7000_acpi_madt_oem_check, + .apic_id_registered = es7000_apic_id_registered, + + .irq_delivery_mode = dest_Fixed, + /* phys delivery to target CPUs: */ + .irq_dest_mode = 0, + + .target_cpus = es7000_target_cpus, + .disable_esr = 1, + .dest_logical = 0, + .check_apicid_used = es7000_check_apicid_used, + .check_apicid_present = es7000_check_apicid_present, + + .vector_allocation_domain = es7000_vector_allocation_domain, + .init_apic_ldr = es7000_init_apic_ldr, + + .ioapic_phys_id_map = es7000_ioapic_phys_id_map, + .setup_apic_routing = es7000_setup_apic_routing, + .multi_timer_check = NULL, + .apicid_to_node = es7000_apicid_to_node, + .cpu_to_logical_apicid = es7000_cpu_to_logical_apicid, + .cpu_present_to_apicid = es7000_cpu_present_to_apicid, + .apicid_to_cpu_present = es7000_apicid_to_cpu_present, + .setup_portio_remap = NULL, + .check_phys_apicid_present = es7000_check_phys_apicid_present, + .enable_apic_mode = es7000_enable_apic_mode, + .phys_pkg_id = es7000_phys_pkg_id, + .mps_oem_check = es7000_mps_oem_check, + + .get_apic_id = es7000_get_apic_id, + .set_apic_id = NULL, + .apic_id_mask = 0xFF << 24, + + .cpu_mask_to_apicid = es7000_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = es7000_cpu_mask_to_apicid_and, + + .send_IPI_mask = es7000_send_IPI_mask, + .send_IPI_mask_allbutself = NULL, + .send_IPI_allbutself = es7000_send_IPI_allbutself, + .send_IPI_all = es7000_send_IPI_all, + .send_IPI_self = default_send_IPI_self, + + .wakeup_cpu = NULL, + + .trampoline_phys_low = 0x467, + .trampoline_phys_high = 0x469, + + .wait_for_init_deassert = es7000_wait_for_init_deassert, + + /* Nothing to do for most platforms, since cleared by the INIT cycle: */ + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = default_inquire_remote_apic, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/io_apic.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/io_apic.c @@ -0,0 +1,4172 @@ +/* + * Intel IO-APIC support for multi-Pentium hosts. + * + * Copyright (C) 1997, 1998, 1999, 2000, 2009 Ingo Molnar, Hajnalka Szabo + * + * Many thanks to Stig Venaas for trying out countless experimental + * patches and reporting/debugging problems patiently! + * + * (c) 1999, Multiple IO-APIC support, developed by + * Ken-ichi Yaku and + * Hidemi Kishimoto , + * further tested and cleaned up by Zach Brown + * and Ingo Molnar + * + * Fixes + * Maciej W. Rozycki : Bits for genuine 82489DX APICs; + * thanks to Eric Gilmore + * and Rolf G. Tews + * for testing these extensively + * Paul Diefenbaugh : Added full ACPI support + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include /* time_after() */ +#ifdef CONFIG_ACPI +#include +#endif +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define __apicdebuginit(type) static type __init + +/* + * Is the SiS APIC rmw bug present ? + * -1 = don't know, 0 = no, 1 = yes + */ +int sis_apic_bug = -1; + +static DEFINE_RAW_SPINLOCK(ioapic_lock); +static DEFINE_RAW_SPINLOCK(vector_lock); + +/* + * # of IRQ routing registers + */ +int nr_ioapic_registers[MAX_IO_APICS]; + +/* I/O APIC entries */ +struct mpc_ioapic mp_ioapics[MAX_IO_APICS]; +int nr_ioapics; + +/* MP IRQ source entries */ +struct mpc_intsrc mp_irqs[MAX_IRQ_SOURCES]; + +/* # of MP IRQ source entries */ +int mp_irq_entries; + +#if defined (CONFIG_MCA) || defined (CONFIG_EISA) +int mp_bus_id_to_type[MAX_MP_BUSSES]; +#endif + +DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); + +int skip_ioapic_setup; + +void arch_disable_smp_support(void) +{ +#ifdef CONFIG_PCI + noioapicquirk = 1; + noioapicreroute = -1; +#endif + skip_ioapic_setup = 1; +} + +static int __init parse_noapic(char *str) +{ + /* disable IO-APIC */ + arch_disable_smp_support(); + return 0; +} +early_param("noapic", parse_noapic); + +struct irq_pin_list; + +/* + * This is performance-critical, we want to do it O(1) + * + * the indexing order of this array favors 1:1 mappings + * between pins and IRQs. + */ + +struct irq_pin_list { + int apic, pin; + struct irq_pin_list *next; +}; + +static struct irq_pin_list *get_one_free_irq_2_pin(int cpu) +{ + struct irq_pin_list *pin; + int node; + + node = cpu_to_node(cpu); + + pin = kzalloc_node(sizeof(*pin), GFP_ATOMIC, node); + + return pin; +} + +struct irq_cfg { + struct irq_pin_list *irq_2_pin; + cpumask_var_t domain; + cpumask_var_t old_domain; + unsigned move_cleanup_count; + u8 vector; + u8 move_in_progress : 1; +#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC + u8 move_desc_pending : 1; +#endif +}; + +/* irq_cfg is indexed by the sum of all RTEs in all I/O APICs. */ +#ifdef CONFIG_SPARSE_IRQ +static struct irq_cfg irq_cfgx[] = { +#else +static struct irq_cfg irq_cfgx[NR_IRQS] = { +#endif + [0] = { .vector = IRQ0_VECTOR, }, + [1] = { .vector = IRQ1_VECTOR, }, + [2] = { .vector = IRQ2_VECTOR, }, + [3] = { .vector = IRQ3_VECTOR, }, + [4] = { .vector = IRQ4_VECTOR, }, + [5] = { .vector = IRQ5_VECTOR, }, + [6] = { .vector = IRQ6_VECTOR, }, + [7] = { .vector = IRQ7_VECTOR, }, + [8] = { .vector = IRQ8_VECTOR, }, + [9] = { .vector = IRQ9_VECTOR, }, + [10] = { .vector = IRQ10_VECTOR, }, + [11] = { .vector = IRQ11_VECTOR, }, + [12] = { .vector = IRQ12_VECTOR, }, + [13] = { .vector = IRQ13_VECTOR, }, + [14] = { .vector = IRQ14_VECTOR, }, + [15] = { .vector = IRQ15_VECTOR, }, +}; + +int __init arch_early_irq_init(void) +{ + struct irq_cfg *cfg; + struct irq_desc *desc; + int count; + int i; + + cfg = irq_cfgx; + count = ARRAY_SIZE(irq_cfgx); + + for (i = 0; i < count; i++) { + desc = irq_to_desc(i); + desc->chip_data = &cfg[i]; + alloc_bootmem_cpumask_var(&cfg[i].domain); + alloc_bootmem_cpumask_var(&cfg[i].old_domain); + if (i < NR_IRQS_LEGACY) + cpumask_setall(cfg[i].domain); + } + + return 0; +} + +#ifdef CONFIG_SPARSE_IRQ +static struct irq_cfg *irq_cfg(unsigned int irq) +{ + struct irq_cfg *cfg = NULL; + struct irq_desc *desc; + + desc = irq_to_desc(irq); + if (desc) + cfg = desc->chip_data; + + return cfg; +} + +static struct irq_cfg *get_one_free_irq_cfg(int cpu) +{ + struct irq_cfg *cfg; + int node; + + node = cpu_to_node(cpu); + + cfg = kzalloc_node(sizeof(*cfg), GFP_ATOMIC, node); + if (cfg) { + if (!alloc_cpumask_var_node(&cfg->domain, GFP_ATOMIC, node)) { + kfree(cfg); + cfg = NULL; + } else if (!alloc_cpumask_var_node(&cfg->old_domain, + GFP_ATOMIC, node)) { + free_cpumask_var(cfg->domain); + kfree(cfg); + cfg = NULL; + } else { + cpumask_clear(cfg->domain); + cpumask_clear(cfg->old_domain); + } + } + + return cfg; +} + +int arch_init_chip_data(struct irq_desc *desc, int cpu) +{ + struct irq_cfg *cfg; + + cfg = desc->chip_data; + if (!cfg) { + desc->chip_data = get_one_free_irq_cfg(cpu); + if (!desc->chip_data) { + printk(KERN_ERR "can not alloc irq_cfg\n"); + BUG_ON(1); + } + } + + return 0; +} + +#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC + +static void +init_copy_irq_2_pin(struct irq_cfg *old_cfg, struct irq_cfg *cfg, int cpu) +{ + struct irq_pin_list *old_entry, *head, *tail, *entry; + + cfg->irq_2_pin = NULL; + old_entry = old_cfg->irq_2_pin; + if (!old_entry) + return; + + entry = get_one_free_irq_2_pin(cpu); + if (!entry) + return; + + entry->apic = old_entry->apic; + entry->pin = old_entry->pin; + head = entry; + tail = entry; + old_entry = old_entry->next; + while (old_entry) { + entry = get_one_free_irq_2_pin(cpu); + if (!entry) { + entry = head; + while (entry) { + head = entry->next; + kfree(entry); + entry = head; + } + /* still use the old one */ + return; + } + entry->apic = old_entry->apic; + entry->pin = old_entry->pin; + tail->next = entry; + tail = entry; + old_entry = old_entry->next; + } + + tail->next = NULL; + cfg->irq_2_pin = head; +} + +static void free_irq_2_pin(struct irq_cfg *old_cfg, struct irq_cfg *cfg) +{ + struct irq_pin_list *entry, *next; + + if (old_cfg->irq_2_pin == cfg->irq_2_pin) + return; + + entry = old_cfg->irq_2_pin; + + while (entry) { + next = entry->next; + kfree(entry); + entry = next; + } + old_cfg->irq_2_pin = NULL; +} + +void arch_init_copy_chip_data(struct irq_desc *old_desc, + struct irq_desc *desc, int cpu) +{ + struct irq_cfg *cfg; + struct irq_cfg *old_cfg; + + cfg = get_one_free_irq_cfg(cpu); + + if (!cfg) + return; + + desc->chip_data = cfg; + + old_cfg = old_desc->chip_data; + + memcpy(cfg, old_cfg, sizeof(struct irq_cfg)); + + init_copy_irq_2_pin(old_cfg, cfg, cpu); +} + +static void free_irq_cfg(struct irq_cfg *old_cfg) +{ + kfree(old_cfg); +} + +void arch_free_chip_data(struct irq_desc *old_desc, struct irq_desc *desc) +{ + struct irq_cfg *old_cfg, *cfg; + + old_cfg = old_desc->chip_data; + cfg = desc->chip_data; + + if (old_cfg == cfg) + return; + + if (old_cfg) { + free_irq_2_pin(old_cfg, cfg); + free_irq_cfg(old_cfg); + old_desc->chip_data = NULL; + } +} + +static void +set_extra_move_desc(struct irq_desc *desc, const struct cpumask *mask) +{ + struct irq_cfg *cfg = desc->chip_data; + + if (!cfg->move_in_progress) { + /* it means that domain is not changed */ + if (!cpumask_intersects(desc->affinity, mask)) + cfg->move_desc_pending = 1; + } +} +#endif + +#else +static struct irq_cfg *irq_cfg(unsigned int irq) +{ + return irq < nr_irqs ? irq_cfgx + irq : NULL; +} + +#endif + +#ifndef CONFIG_NUMA_MIGRATE_IRQ_DESC +static inline void +set_extra_move_desc(struct irq_desc *desc, const struct cpumask *mask) +{ +} +#endif + +struct io_apic { + unsigned int index; + unsigned int unused[3]; + unsigned int data; +}; + +static __attribute_const__ struct io_apic __iomem *io_apic_base(int idx) +{ + return (void __iomem *) __fix_to_virt(FIX_IO_APIC_BASE_0 + idx) + + (mp_ioapics[idx].apicaddr & ~PAGE_MASK); +} + +static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg) +{ + struct io_apic __iomem *io_apic = io_apic_base(apic); + writel(reg, &io_apic->index); + return readl(&io_apic->data); +} + +static inline void io_apic_write(unsigned int apic, unsigned int reg, unsigned int value) +{ + struct io_apic __iomem *io_apic = io_apic_base(apic); + writel(reg, &io_apic->index); + writel(value, &io_apic->data); +} + +/* + * Re-write a value: to be used for read-modify-write + * cycles where the read already set up the index register. + * + * Older SiS APIC requires we rewrite the index register + */ +static inline void io_apic_modify(unsigned int apic, unsigned int reg, unsigned int value) +{ + struct io_apic __iomem *io_apic = io_apic_base(apic); + + if (sis_apic_bug) + writel(reg, &io_apic->index); + writel(value, &io_apic->data); +} + +static bool io_apic_level_ack_pending(struct irq_cfg *cfg) +{ + struct irq_pin_list *entry; + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + entry = cfg->irq_2_pin; + for (;;) { + unsigned int reg; + int pin; + + if (!entry) + break; + pin = entry->pin; + reg = io_apic_read(entry->apic, 0x10 + pin*2); + /* Is the remote IRR bit set? */ + if (reg & IO_APIC_REDIR_REMOTE_IRR) { + spin_unlock_irqrestore(&ioapic_lock, flags); + return true; + } + if (!entry->next) + break; + entry = entry->next; + } + spin_unlock_irqrestore(&ioapic_lock, flags); + + return false; +} + +union entry_union { + struct { u32 w1, w2; }; + struct IO_APIC_route_entry entry; +}; + +static struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin) +{ + union entry_union eu; + unsigned long flags; + spin_lock_irqsave(&ioapic_lock, flags); + eu.w1 = io_apic_read(apic, 0x10 + 2 * pin); + eu.w2 = io_apic_read(apic, 0x11 + 2 * pin); + spin_unlock_irqrestore(&ioapic_lock, flags); + return eu.entry; +} + +/* + * When we write a new IO APIC routing entry, we need to write the high + * word first! If the mask bit in the low word is clear, we will enable + * the interrupt, and we need to make sure the entry is fully populated + * before that happens. + */ +static void +__ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) +{ + union entry_union eu; + eu.entry = e; + io_apic_write(apic, 0x11 + 2*pin, eu.w2); + io_apic_write(apic, 0x10 + 2*pin, eu.w1); +} + +void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) +{ + unsigned long flags; + spin_lock_irqsave(&ioapic_lock, flags); + __ioapic_write_entry(apic, pin, e); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +/* + * When we mask an IO APIC routing entry, we need to write the low + * word first, in order to set the mask bit before we change the + * high bits! + */ +static void ioapic_mask_entry(int apic, int pin) +{ + unsigned long flags; + union entry_union eu = { .entry.mask = 1 }; + + spin_lock_irqsave(&ioapic_lock, flags); + io_apic_write(apic, 0x10 + 2*pin, eu.w1); + io_apic_write(apic, 0x11 + 2*pin, eu.w2); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +#ifdef CONFIG_SMP +static void send_cleanup_vector(struct irq_cfg *cfg) +{ + cpumask_var_t cleanup_mask; + + if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) { + unsigned int i; + cfg->move_cleanup_count = 0; + for_each_cpu_and(i, cfg->old_domain, cpu_online_mask) + cfg->move_cleanup_count++; + for_each_cpu_and(i, cfg->old_domain, cpu_online_mask) + apic->send_IPI_mask(cpumask_of(i), IRQ_MOVE_CLEANUP_VECTOR); + } else { + cpumask_and(cleanup_mask, cfg->old_domain, cpu_online_mask); + cfg->move_cleanup_count = cpumask_weight(cleanup_mask); + apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR); + free_cpumask_var(cleanup_mask); + } + cfg->move_in_progress = 0; +} + +static void __target_IO_APIC_irq(unsigned int irq, unsigned int dest, struct irq_cfg *cfg) +{ + int apic, pin; + struct irq_pin_list *entry; + u8 vector = cfg->vector; + + entry = cfg->irq_2_pin; + for (;;) { + unsigned int reg; + + if (!entry) + break; + + apic = entry->apic; + pin = entry->pin; +#ifdef CONFIG_INTR_REMAP + /* + * With interrupt-remapping, destination information comes + * from interrupt-remapping table entry. + */ + if (!irq_remapped(irq)) + io_apic_write(apic, 0x11 + pin*2, dest); +#else + io_apic_write(apic, 0x11 + pin*2, dest); +#endif + reg = io_apic_read(apic, 0x10 + pin*2); + reg &= ~IO_APIC_REDIR_VECTOR_MASK; + reg |= vector; + io_apic_modify(apic, 0x10 + pin*2, reg); + if (!entry->next) + break; + entry = entry->next; + } +} + +static int +assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask); + +/* + * Either sets desc->affinity to a valid value, and returns + * ->cpu_mask_to_apicid of that, or returns BAD_APICID and + * leaves desc->affinity untouched. + */ +static unsigned int +set_desc_affinity(struct irq_desc *desc, const struct cpumask *mask) +{ + struct irq_cfg *cfg; + unsigned int irq; + + if (!cpumask_intersects(mask, cpu_online_mask)) + return BAD_APICID; + + irq = desc->irq; + cfg = desc->chip_data; + if (assign_irq_vector(irq, cfg, mask)) + return BAD_APICID; + + cpumask_and(desc->affinity, cfg->domain, mask); + set_extra_move_desc(desc, mask); + + return apic->cpu_mask_to_apicid_and(desc->affinity, cpu_online_mask); +} + +static void +set_ioapic_affinity_irq_desc(struct irq_desc *desc, const struct cpumask *mask) +{ + struct irq_cfg *cfg; + unsigned long flags; + unsigned int dest; + unsigned int irq; + + irq = desc->irq; + cfg = desc->chip_data; + + spin_lock_irqsave(&ioapic_lock, flags); + dest = set_desc_affinity(desc, mask); + if (dest != BAD_APICID) { + /* Only the high 8 bits are valid. */ + dest = SET_APIC_LOGICAL_ID(dest); + __target_IO_APIC_irq(irq, dest, cfg); + } + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +static void +set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc; + + desc = irq_to_desc(irq); + + set_ioapic_affinity_irq_desc(desc, mask); +} +#endif /* CONFIG_SMP */ + +/* + * The common case is 1:1 IRQ<->pin mappings. Sometimes there are + * shared ISA-space IRQs, so we have to support them. We are super + * fast in the common case, and fast for shared ISA-space IRQs. + */ +static void add_pin_to_irq_cpu(struct irq_cfg *cfg, int cpu, int apic, int pin) +{ + struct irq_pin_list *entry; + + entry = cfg->irq_2_pin; + if (!entry) { + entry = get_one_free_irq_2_pin(cpu); + if (!entry) { + printk(KERN_ERR "can not alloc irq_2_pin to add %d - %d\n", + apic, pin); + return; + } + cfg->irq_2_pin = entry; + entry->apic = apic; + entry->pin = pin; + return; + } + + while (entry->next) { + /* not again, please */ + if (entry->apic == apic && entry->pin == pin) + return; + + entry = entry->next; + } + + entry->next = get_one_free_irq_2_pin(cpu); + entry = entry->next; + entry->apic = apic; + entry->pin = pin; +} + +/* + * Reroute an IRQ to a different pin. + */ +static void __init replace_pin_at_irq_cpu(struct irq_cfg *cfg, int cpu, + int oldapic, int oldpin, + int newapic, int newpin) +{ + struct irq_pin_list *entry = cfg->irq_2_pin; + int replaced = 0; + + while (entry) { + if (entry->apic == oldapic && entry->pin == oldpin) { + entry->apic = newapic; + entry->pin = newpin; + replaced = 1; + /* every one is different, right? */ + break; + } + entry = entry->next; + } + + /* why? call replace before add? */ + if (!replaced) + add_pin_to_irq_cpu(cfg, cpu, newapic, newpin); +} + +static inline void io_apic_modify_irq(struct irq_cfg *cfg, + int mask_and, int mask_or, + void (*final)(struct irq_pin_list *entry)) +{ + int pin; + struct irq_pin_list *entry; + + for (entry = cfg->irq_2_pin; entry != NULL; entry = entry->next) { + unsigned int reg; + pin = entry->pin; + reg = io_apic_read(entry->apic, 0x10 + pin * 2); + reg &= mask_and; + reg |= mask_or; + io_apic_modify(entry->apic, 0x10 + pin * 2, reg); + if (final) + final(entry); + } +} + +static void __unmask_IO_APIC_irq(struct irq_cfg *cfg) +{ + io_apic_modify_irq(cfg, ~IO_APIC_REDIR_MASKED, 0, NULL); +} + +#ifdef CONFIG_X86_64 +static void io_apic_sync(struct irq_pin_list *entry) +{ + /* + * Synchronize the IO-APIC and the CPU by doing + * a dummy read from the IO-APIC + */ + struct io_apic __iomem *io_apic; + io_apic = io_apic_base(entry->apic); + readl(&io_apic->data); +} + +static void __mask_IO_APIC_irq(struct irq_cfg *cfg) +{ + io_apic_modify_irq(cfg, ~0, IO_APIC_REDIR_MASKED, &io_apic_sync); +} +#else /* CONFIG_X86_32 */ +static void __mask_IO_APIC_irq(struct irq_cfg *cfg) +{ + io_apic_modify_irq(cfg, ~0, IO_APIC_REDIR_MASKED, NULL); +} + +static void __mask_and_edge_IO_APIC_irq(struct irq_cfg *cfg) +{ + io_apic_modify_irq(cfg, ~IO_APIC_REDIR_LEVEL_TRIGGER, + IO_APIC_REDIR_MASKED, NULL); +} + +static void __unmask_and_level_IO_APIC_irq(struct irq_cfg *cfg) +{ + io_apic_modify_irq(cfg, ~IO_APIC_REDIR_MASKED, + IO_APIC_REDIR_LEVEL_TRIGGER, NULL); +} +#endif /* CONFIG_X86_32 */ + +static void mask_IO_APIC_irq_desc(struct irq_desc *desc) +{ + struct irq_cfg *cfg = desc->chip_data; + unsigned long flags; + + BUG_ON(!cfg); + + spin_lock_irqsave(&ioapic_lock, flags); + __mask_IO_APIC_irq(cfg); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +static void unmask_IO_APIC_irq_desc(struct irq_desc *desc) +{ + struct irq_cfg *cfg = desc->chip_data; + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + __unmask_IO_APIC_irq(cfg); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +static void mask_IO_APIC_irq(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + + mask_IO_APIC_irq_desc(desc); +} +static void unmask_IO_APIC_irq(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + + unmask_IO_APIC_irq_desc(desc); +} + +static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) +{ + struct IO_APIC_route_entry entry; + + /* Check delivery_mode to be sure we're not clearing an SMI pin */ + entry = ioapic_read_entry(apic, pin); + if (entry.delivery_mode == dest_SMI) + return; + /* + * Disable it in the IO-APIC irq-routing table: + */ + ioapic_mask_entry(apic, pin); +} + +static void clear_IO_APIC (void) +{ + int apic, pin; + + for (apic = 0; apic < nr_ioapics; apic++) + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) + clear_IO_APIC_pin(apic, pin); +} + +#ifdef CONFIG_X86_32 +/* + * support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to + * specific CPU-side IRQs. + */ + +#define MAX_PIRQS 8 +static int pirq_entries[MAX_PIRQS] = { + [0 ... MAX_PIRQS - 1] = -1 +}; + +static int __init ioapic_pirq_setup(char *str) +{ + int i, max; + int ints[MAX_PIRQS+1]; + + get_options(str, ARRAY_SIZE(ints), ints); + + apic_printk(APIC_VERBOSE, KERN_INFO + "PIRQ redirection, working around broken MP-BIOS.\n"); + max = MAX_PIRQS; + if (ints[0] < MAX_PIRQS) + max = ints[0]; + + for (i = 0; i < max; i++) { + apic_printk(APIC_VERBOSE, KERN_DEBUG + "... PIRQ%d -> IRQ %d\n", i, ints[i+1]); + /* + * PIRQs are mapped upside down, usually. + */ + pirq_entries[MAX_PIRQS-i-1] = ints[i+1]; + } + return 1; +} + +__setup("pirq=", ioapic_pirq_setup); +#endif /* CONFIG_X86_32 */ + +#ifdef CONFIG_INTR_REMAP +/* I/O APIC RTE contents at the OS boot up */ +static struct IO_APIC_route_entry *early_ioapic_entries[MAX_IO_APICS]; + +/* + * Saves and masks all the unmasked IO-APIC RTE's + */ +int save_mask_IO_APIC_setup(void) +{ + union IO_APIC_reg_01 reg_01; + unsigned long flags; + int apic, pin; + + /* + * The number of IO-APIC IRQ registers (== #pins): + */ + for (apic = 0; apic < nr_ioapics; apic++) { + spin_lock_irqsave(&ioapic_lock, flags); + reg_01.raw = io_apic_read(apic, 1); + spin_unlock_irqrestore(&ioapic_lock, flags); + nr_ioapic_registers[apic] = reg_01.bits.entries+1; + } + + for (apic = 0; apic < nr_ioapics; apic++) { + early_ioapic_entries[apic] = + kzalloc(sizeof(struct IO_APIC_route_entry) * + nr_ioapic_registers[apic], GFP_KERNEL); + if (!early_ioapic_entries[apic]) + goto nomem; + } + + for (apic = 0; apic < nr_ioapics; apic++) + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { + struct IO_APIC_route_entry entry; + + entry = early_ioapic_entries[apic][pin] = + ioapic_read_entry(apic, pin); + if (!entry.mask) { + entry.mask = 1; + ioapic_write_entry(apic, pin, entry); + } + } + + return 0; + +nomem: + while (apic >= 0) + kfree(early_ioapic_entries[apic--]); + memset(early_ioapic_entries, 0, + ARRAY_SIZE(early_ioapic_entries)); + + return -ENOMEM; +} + +void restore_IO_APIC_setup(void) +{ + int apic, pin; + + for (apic = 0; apic < nr_ioapics; apic++) { + if (!early_ioapic_entries[apic]) + break; + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) + ioapic_write_entry(apic, pin, + early_ioapic_entries[apic][pin]); + kfree(early_ioapic_entries[apic]); + early_ioapic_entries[apic] = NULL; + } +} + +void reinit_intr_remapped_IO_APIC(int intr_remapping) +{ + /* + * for now plain restore of previous settings. + * TBD: In the case of OS enabling interrupt-remapping, + * IO-APIC RTE's need to be setup to point to interrupt-remapping + * table entries. for now, do a plain restore, and wait for + * the setup_IO_APIC_irqs() to do proper initialization. + */ + restore_IO_APIC_setup(); +} +#endif + +/* + * Find the IRQ entry number of a certain pin. + */ +static int find_irq_entry(int apic, int pin, int type) +{ + int i; + + for (i = 0; i < mp_irq_entries; i++) + if (mp_irqs[i].irqtype == type && + (mp_irqs[i].dstapic == mp_ioapics[apic].apicid || + mp_irqs[i].dstapic == MP_APIC_ALL) && + mp_irqs[i].dstirq == pin) + return i; + + return -1; +} + +/* + * Find the pin to which IRQ[irq] (ISA) is connected + */ +static int __init find_isa_irq_pin(int irq, int type) +{ + int i; + + for (i = 0; i < mp_irq_entries; i++) { + int lbus = mp_irqs[i].srcbus; + + if (test_bit(lbus, mp_bus_not_pci) && + (mp_irqs[i].irqtype == type) && + (mp_irqs[i].srcbusirq == irq)) + + return mp_irqs[i].dstirq; + } + return -1; +} + +static int __init find_isa_irq_apic(int irq, int type) +{ + int i; + + for (i = 0; i < mp_irq_entries; i++) { + int lbus = mp_irqs[i].srcbus; + + if (test_bit(lbus, mp_bus_not_pci) && + (mp_irqs[i].irqtype == type) && + (mp_irqs[i].srcbusirq == irq)) + break; + } + if (i < mp_irq_entries) { + int apic; + for(apic = 0; apic < nr_ioapics; apic++) { + if (mp_ioapics[apic].apicid == mp_irqs[i].dstapic) + return apic; + } + } + + return -1; +} + +/* + * Find a specific PCI IRQ entry. + * Not an __init, possibly needed by modules + */ +static int pin_2_irq(int idx, int apic, int pin); + +int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin) +{ + int apic, i, best_guess = -1; + + apic_printk(APIC_DEBUG, "querying PCI -> IRQ mapping bus:%d, slot:%d, pin:%d.\n", + bus, slot, pin); + if (test_bit(bus, mp_bus_not_pci)) { + apic_printk(APIC_VERBOSE, "PCI BIOS passed nonexistent PCI bus %d!\n", bus); + return -1; + } + for (i = 0; i < mp_irq_entries; i++) { + int lbus = mp_irqs[i].srcbus; + + for (apic = 0; apic < nr_ioapics; apic++) + if (mp_ioapics[apic].apicid == mp_irqs[i].dstapic || + mp_irqs[i].dstapic == MP_APIC_ALL) + break; + + if (!test_bit(lbus, mp_bus_not_pci) && + !mp_irqs[i].irqtype && + (bus == lbus) && + (slot == ((mp_irqs[i].srcbusirq >> 2) & 0x1f))) { + int irq = pin_2_irq(i, apic, mp_irqs[i].dstirq); + + if (!(apic || IO_APIC_IRQ(irq))) + continue; + + if (pin == (mp_irqs[i].srcbusirq & 3)) + return irq; + /* + * Use the first all-but-pin matching entry as a + * best-guess fuzzy result for broken mptables. + */ + if (best_guess < 0) + best_guess = irq; + } + } + return best_guess; +} + +EXPORT_SYMBOL(IO_APIC_get_PCI_irq_vector); + +#if defined(CONFIG_EISA) || defined(CONFIG_MCA) +/* + * EISA Edge/Level control register, ELCR + */ +static int EISA_ELCR(unsigned int irq) +{ + if (irq < NR_IRQS_LEGACY) { + unsigned int port = 0x4d0 + (irq >> 3); + return (inb(port) >> (irq & 7)) & 1; + } + apic_printk(APIC_VERBOSE, KERN_INFO + "Broken MPtable reports ISA irq %d\n", irq); + return 0; +} + +#endif + +/* ISA interrupts are always polarity zero edge triggered, + * when listed as conforming in the MP table. */ + +#define default_ISA_trigger(idx) (0) +#define default_ISA_polarity(idx) (0) + +/* EISA interrupts are always polarity zero and can be edge or level + * trigger depending on the ELCR value. If an interrupt is listed as + * EISA conforming in the MP table, that means its trigger type must + * be read in from the ELCR */ + +#define default_EISA_trigger(idx) (EISA_ELCR(mp_irqs[idx].srcbusirq)) +#define default_EISA_polarity(idx) default_ISA_polarity(idx) + +/* PCI interrupts are always polarity one level triggered, + * when listed as conforming in the MP table. */ + +#define default_PCI_trigger(idx) (1) +#define default_PCI_polarity(idx) (1) + +/* MCA interrupts are always polarity zero level triggered, + * when listed as conforming in the MP table. */ + +#define default_MCA_trigger(idx) (1) +#define default_MCA_polarity(idx) default_ISA_polarity(idx) + +static int MPBIOS_polarity(int idx) +{ + int bus = mp_irqs[idx].srcbus; + int polarity; + + /* + * Determine IRQ line polarity (high active or low active): + */ + switch (mp_irqs[idx].irqflag & 3) + { + case 0: /* conforms, ie. bus-type dependent polarity */ + if (test_bit(bus, mp_bus_not_pci)) + polarity = default_ISA_polarity(idx); + else + polarity = default_PCI_polarity(idx); + break; + case 1: /* high active */ + { + polarity = 0; + break; + } + case 2: /* reserved */ + { + printk(KERN_WARNING "broken BIOS!!\n"); + polarity = 1; + break; + } + case 3: /* low active */ + { + polarity = 1; + break; + } + default: /* invalid */ + { + printk(KERN_WARNING "broken BIOS!!\n"); + polarity = 1; + break; + } + } + return polarity; +} + +static int MPBIOS_trigger(int idx) +{ + int bus = mp_irqs[idx].srcbus; + int trigger; + + /* + * Determine IRQ trigger mode (edge or level sensitive): + */ + switch ((mp_irqs[idx].irqflag>>2) & 3) + { + case 0: /* conforms, ie. bus-type dependent */ + if (test_bit(bus, mp_bus_not_pci)) + trigger = default_ISA_trigger(idx); + else + trigger = default_PCI_trigger(idx); +#if defined(CONFIG_EISA) || defined(CONFIG_MCA) + switch (mp_bus_id_to_type[bus]) { + case MP_BUS_ISA: /* ISA pin */ + { + /* set before the switch */ + break; + } + case MP_BUS_EISA: /* EISA pin */ + { + trigger = default_EISA_trigger(idx); + break; + } + case MP_BUS_PCI: /* PCI pin */ + { + /* set before the switch */ + break; + } + case MP_BUS_MCA: /* MCA pin */ + { + trigger = default_MCA_trigger(idx); + break; + } + default: + { + printk(KERN_WARNING "broken BIOS!!\n"); + trigger = 1; + break; + } + } +#endif + break; + case 1: /* edge */ + { + trigger = 0; + break; + } + case 2: /* reserved */ + { + printk(KERN_WARNING "broken BIOS!!\n"); + trigger = 1; + break; + } + case 3: /* level */ + { + trigger = 1; + break; + } + default: /* invalid */ + { + printk(KERN_WARNING "broken BIOS!!\n"); + trigger = 0; + break; + } + } + return trigger; +} + +static inline int irq_polarity(int idx) +{ + return MPBIOS_polarity(idx); +} + +static inline int irq_trigger(int idx) +{ + return MPBIOS_trigger(idx); +} + +int (*ioapic_renumber_irq)(int ioapic, int irq); +static int pin_2_irq(int idx, int apic, int pin) +{ + int irq, i; + int bus = mp_irqs[idx].srcbus; + + /* + * Debugging check, we are in big trouble if this message pops up! + */ + if (mp_irqs[idx].dstirq != pin) + printk(KERN_ERR "broken BIOS or MPTABLE parser, ayiee!!\n"); + + if (test_bit(bus, mp_bus_not_pci)) { + irq = mp_irqs[idx].srcbusirq; + } else { + /* + * PCI IRQs are mapped in order + */ + i = irq = 0; + while (i < apic) + irq += nr_ioapic_registers[i++]; + irq += pin; + /* + * For MPS mode, so far only needed by ES7000 platform + */ + if (ioapic_renumber_irq) + irq = ioapic_renumber_irq(apic, irq); + } + +#ifdef CONFIG_X86_32 + /* + * PCI IRQ command line redirection. Yes, limits are hardcoded. + */ + if ((pin >= 16) && (pin <= 23)) { + if (pirq_entries[pin-16] != -1) { + if (!pirq_entries[pin-16]) { + apic_printk(APIC_VERBOSE, KERN_DEBUG + "disabling PIRQ%d\n", pin-16); + } else { + irq = pirq_entries[pin-16]; + apic_printk(APIC_VERBOSE, KERN_DEBUG + "using PIRQ%d -> IRQ %d\n", + pin-16, irq); + } + } + } +#endif + + return irq; +} + +void lock_vector_lock(void) +{ + /* Used to the online set of cpus does not change + * during assign_irq_vector. + */ + spin_lock(&vector_lock); +} + +void unlock_vector_lock(void) +{ + spin_unlock(&vector_lock); +} + +static int +__assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) +{ + /* + * NOTE! The local APIC isn't very good at handling + * multiple interrupts at the same interrupt level. + * As the interrupt level is determined by taking the + * vector number and shifting that right by 4, we + * want to spread these out a bit so that they don't + * all fall in the same interrupt level. + * + * Also, we've got to be careful not to trash gate + * 0x80, because int 0x80 is hm, kind of importantish. ;) + */ + static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0; + unsigned int old_vector; + int cpu, err; + cpumask_var_t tmp_mask; + + if ((cfg->move_in_progress) || cfg->move_cleanup_count) + return -EBUSY; + + if (!alloc_cpumask_var(&tmp_mask, GFP_ATOMIC)) + return -ENOMEM; + + old_vector = cfg->vector; + if (old_vector) { + cpumask_and(tmp_mask, mask, cpu_online_mask); + cpumask_and(tmp_mask, cfg->domain, tmp_mask); + if (!cpumask_empty(tmp_mask)) { + free_cpumask_var(tmp_mask); + return 0; + } + } + + /* Only try and allocate irqs on cpus that are present */ + err = -ENOSPC; + for_each_cpu_and(cpu, mask, cpu_online_mask) { + int new_cpu; + int vector, offset; + + apic->vector_allocation_domain(cpu, tmp_mask); + + vector = current_vector; + offset = current_offset; +next: + vector += 8; + if (vector >= first_system_vector) { + /* If out of vectors on large boxen, must share them. */ + offset = (offset + 1) % 8; + vector = FIRST_DEVICE_VECTOR + offset; + } + if (unlikely(current_vector == vector)) + continue; + + if (test_bit(vector, used_vectors)) + goto next; + + for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) + if (per_cpu(vector_irq, new_cpu)[vector] != -1) + goto next; + /* Found one! */ + current_vector = vector; + current_offset = offset; + if (old_vector) { + cfg->move_in_progress = 1; + cpumask_copy(cfg->old_domain, cfg->domain); + } + for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) + per_cpu(vector_irq, new_cpu)[vector] = irq; + cfg->vector = vector; + cpumask_copy(cfg->domain, tmp_mask); + err = 0; + break; + } + free_cpumask_var(tmp_mask); + return err; +} + +static int +assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) +{ + int err; + unsigned long flags; + + spin_lock_irqsave(&vector_lock, flags); + err = __assign_irq_vector(irq, cfg, mask); + spin_unlock_irqrestore(&vector_lock, flags); + return err; +} + +static void __clear_irq_vector(int irq, struct irq_cfg *cfg) +{ + int cpu, vector; + + BUG_ON(!cfg->vector); + + vector = cfg->vector; + for_each_cpu_and(cpu, cfg->domain, cpu_online_mask) + per_cpu(vector_irq, cpu)[vector] = -1; + + cfg->vector = 0; + cpumask_clear(cfg->domain); + + if (likely(!cfg->move_in_progress)) + return; + for_each_cpu_and(cpu, cfg->old_domain, cpu_online_mask) { + for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; + vector++) { + if (per_cpu(vector_irq, cpu)[vector] != irq) + continue; + per_cpu(vector_irq, cpu)[vector] = -1; + break; + } + } + cfg->move_in_progress = 0; +} + +void __setup_vector_irq(int cpu) +{ + /* Initialize vector_irq on a new cpu */ + /* This function must be called with vector_lock held */ + int irq, vector; + struct irq_cfg *cfg; + struct irq_desc *desc; + + /* Mark the inuse vectors */ + for_each_irq_desc(irq, desc) { + cfg = desc->chip_data; + if (!cpumask_test_cpu(cpu, cfg->domain)) + continue; + vector = cfg->vector; + per_cpu(vector_irq, cpu)[vector] = irq; + } + /* Mark the free vectors */ + for (vector = 0; vector < NR_VECTORS; ++vector) { + irq = per_cpu(vector_irq, cpu)[vector]; + if (irq < 0) + continue; + + cfg = irq_cfg(irq); + if (!cpumask_test_cpu(cpu, cfg->domain)) + per_cpu(vector_irq, cpu)[vector] = -1; + } +} + +static struct irq_chip ioapic_chip; +#ifdef CONFIG_INTR_REMAP +static struct irq_chip ir_ioapic_chip; +#endif + +#define IOAPIC_AUTO -1 +#define IOAPIC_EDGE 0 +#define IOAPIC_LEVEL 1 + +#ifdef CONFIG_X86_32 +static inline int IO_APIC_irq_trigger(int irq) +{ + int apic, idx, pin; + + for (apic = 0; apic < nr_ioapics; apic++) { + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { + idx = find_irq_entry(apic, pin, mp_INT); + if ((idx != -1) && (irq == pin_2_irq(idx, apic, pin))) + return irq_trigger(idx); + } + } + /* + * nonexistent IRQs are edge default + */ + return 0; +} +#else +static inline int IO_APIC_irq_trigger(int irq) +{ + return 1; +} +#endif + +static void ioapic_register_intr(int irq, struct irq_desc *desc, unsigned long trigger) +{ + + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + desc->status |= IRQ_LEVEL; + else + desc->status &= ~IRQ_LEVEL; + +#ifdef CONFIG_INTR_REMAP + if (irq_remapped(irq)) { + desc->status |= IRQ_MOVE_PCNTXT; + if (trigger) + set_irq_chip_and_handler_name(irq, &ir_ioapic_chip, + handle_fasteoi_irq, + "fasteoi"); + else + set_irq_chip_and_handler_name(irq, &ir_ioapic_chip, + handle_edge_irq, "edge"); + return; + } +#endif + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + set_irq_chip_and_handler_name(irq, &ioapic_chip, + handle_fasteoi_irq, + "fasteoi"); + else + set_irq_chip_and_handler_name(irq, &ioapic_chip, + handle_edge_irq, "edge"); +} + +int setup_ioapic_entry(int apic_id, int irq, + struct IO_APIC_route_entry *entry, + unsigned int destination, int trigger, + int polarity, int vector) +{ + /* + * add it to the IO-APIC irq-routing table: + */ + memset(entry,0,sizeof(*entry)); + +#ifdef CONFIG_INTR_REMAP + if (intr_remapping_enabled) { + struct intel_iommu *iommu = map_ioapic_to_ir(apic_id); + struct irte irte; + struct IR_IO_APIC_route_entry *ir_entry = + (struct IR_IO_APIC_route_entry *) entry; + int index; + + if (!iommu) + panic("No mapping iommu for ioapic %d\n", apic_id); + + index = alloc_irte(iommu, irq, 1); + if (index < 0) + panic("Failed to allocate IRTE for ioapic %d\n", apic_id); + + memset(&irte, 0, sizeof(irte)); + + irte.present = 1; + irte.dst_mode = apic->irq_dest_mode; + irte.trigger_mode = trigger; + irte.dlvry_mode = apic->irq_delivery_mode; + irte.vector = vector; + irte.dest_id = IRTE_DEST(destination); + + modify_irte(irq, &irte); + + ir_entry->index2 = (index >> 15) & 0x1; + ir_entry->zero = 0; + ir_entry->format = 1; + ir_entry->index = (index & 0x7fff); + } else +#endif + { + entry->delivery_mode = apic->irq_delivery_mode; + entry->dest_mode = apic->irq_dest_mode; + entry->dest = destination; + } + + entry->mask = 0; /* enable IRQ */ + entry->trigger = trigger; + entry->polarity = polarity; + entry->vector = vector; + + /* Mask level triggered irqs. + * Use IRQ_DELAYED_DISABLE for edge triggered irqs. + */ + if (trigger) + entry->mask = 1; + return 0; +} + +static void setup_IO_APIC_irq(int apic_id, int pin, unsigned int irq, struct irq_desc *desc, + int trigger, int polarity) +{ + struct irq_cfg *cfg; + struct IO_APIC_route_entry entry; + unsigned int dest; + + if (!IO_APIC_IRQ(irq)) + return; + + cfg = desc->chip_data; + + if (assign_irq_vector(irq, cfg, apic->target_cpus())) + return; + + dest = apic->cpu_mask_to_apicid_and(cfg->domain, apic->target_cpus()); + + apic_printk(APIC_VERBOSE,KERN_DEBUG + "IOAPIC[%d]: Set routing entry (%d-%d -> 0x%x -> " + "IRQ %d Mode:%i Active:%i)\n", + apic_id, mp_ioapics[apic_id].apicid, pin, cfg->vector, + irq, trigger, polarity); + + + if (setup_ioapic_entry(mp_ioapics[apic_id].apicid, irq, &entry, + dest, trigger, polarity, cfg->vector)) { + printk("Failed to setup ioapic entry for ioapic %d, pin %d\n", + mp_ioapics[apic_id].apicid, pin); + __clear_irq_vector(irq, cfg); + return; + } + + ioapic_register_intr(irq, desc, trigger); + if (irq < NR_IRQS_LEGACY) + disable_8259A_irq(irq); + + ioapic_write_entry(apic_id, pin, entry); +} + +static void __init setup_IO_APIC_irqs(void) +{ + int apic_id, pin, idx, irq; + int notcon = 0; + struct irq_desc *desc; + struct irq_cfg *cfg; + int cpu = boot_cpu_id; + + apic_printk(APIC_VERBOSE, KERN_DEBUG "init IO_APIC IRQs\n"); + + for (apic_id = 0; apic_id < nr_ioapics; apic_id++) { + for (pin = 0; pin < nr_ioapic_registers[apic_id]; pin++) { + + idx = find_irq_entry(apic_id, pin, mp_INT); + if (idx == -1) { + if (!notcon) { + notcon = 1; + apic_printk(APIC_VERBOSE, + KERN_DEBUG " %d-%d", + mp_ioapics[apic_id].apicid, pin); + } else + apic_printk(APIC_VERBOSE, " %d-%d", + mp_ioapics[apic_id].apicid, pin); + continue; + } + if (notcon) { + apic_printk(APIC_VERBOSE, + " (apicid-pin) not connected\n"); + notcon = 0; + } + + irq = pin_2_irq(idx, apic_id, pin); + + /* + * Skip the timer IRQ if there's a quirk handler + * installed and if it returns 1: + */ + if (apic->multi_timer_check && + apic->multi_timer_check(apic_id, irq)) + continue; + + desc = irq_to_desc_alloc_cpu(irq, cpu); + if (!desc) { + printk(KERN_INFO "can not get irq_desc for %d\n", irq); + continue; + } + cfg = desc->chip_data; + add_pin_to_irq_cpu(cfg, cpu, apic_id, pin); + + setup_IO_APIC_irq(apic_id, pin, irq, desc, + irq_trigger(idx), irq_polarity(idx)); + } + } + + if (notcon) + apic_printk(APIC_VERBOSE, + " (apicid-pin) not connected\n"); +} + +/* + * Set up the timer pin, possibly with the 8259A-master behind. + */ +static void __init setup_timer_IRQ0_pin(unsigned int apic_id, unsigned int pin, + int vector) +{ + struct IO_APIC_route_entry entry; + +#ifdef CONFIG_INTR_REMAP + if (intr_remapping_enabled) + return; +#endif + + memset(&entry, 0, sizeof(entry)); + + /* + * We use logical delivery to get the timer IRQ + * to the first CPU. + */ + entry.dest_mode = apic->irq_dest_mode; + entry.mask = 0; /* don't mask IRQ for edge */ + entry.dest = apic->cpu_mask_to_apicid(apic->target_cpus()); + entry.delivery_mode = apic->irq_delivery_mode; + entry.polarity = 0; + entry.trigger = 0; + entry.vector = vector; + + /* + * The timer IRQ doesn't have to know that behind the + * scene we may have a 8259A-master in AEOI mode ... + */ + set_irq_chip_and_handler_name(0, &ioapic_chip, handle_edge_irq, "edge"); + + /* + * Add it to the IO-APIC irq-routing table: + */ + ioapic_write_entry(apic_id, pin, entry); +} + + +__apicdebuginit(void) print_IO_APIC(void) +{ + int apic, i; + union IO_APIC_reg_00 reg_00; + union IO_APIC_reg_01 reg_01; + union IO_APIC_reg_02 reg_02; + union IO_APIC_reg_03 reg_03; + unsigned long flags; + struct irq_cfg *cfg; + struct irq_desc *desc; + unsigned int irq; + + if (apic_verbosity == APIC_QUIET) + return; + + printk(KERN_DEBUG "number of MP IRQ sources: %d.\n", mp_irq_entries); + for (i = 0; i < nr_ioapics; i++) + printk(KERN_DEBUG "number of IO-APIC #%d registers: %d.\n", + mp_ioapics[i].apicid, nr_ioapic_registers[i]); + + /* + * We are a bit conservative about what we expect. We have to + * know about every hardware change ASAP. + */ + printk(KERN_INFO "testing the IO APIC.......................\n"); + + for (apic = 0; apic < nr_ioapics; apic++) { + + spin_lock_irqsave(&ioapic_lock, flags); + reg_00.raw = io_apic_read(apic, 0); + reg_01.raw = io_apic_read(apic, 1); + if (reg_01.bits.version >= 0x10) + reg_02.raw = io_apic_read(apic, 2); + if (reg_01.bits.version >= 0x20) + reg_03.raw = io_apic_read(apic, 3); + spin_unlock_irqrestore(&ioapic_lock, flags); + + printk("\n"); + printk(KERN_DEBUG "IO APIC #%d......\n", mp_ioapics[apic].apicid); + printk(KERN_DEBUG ".... register #00: %08X\n", reg_00.raw); + printk(KERN_DEBUG "....... : physical APIC id: %02X\n", reg_00.bits.ID); + printk(KERN_DEBUG "....... : Delivery Type: %X\n", reg_00.bits.delivery_type); + printk(KERN_DEBUG "....... : LTS : %X\n", reg_00.bits.LTS); + + printk(KERN_DEBUG ".... register #01: %08X\n", *(int *)®_01); + printk(KERN_DEBUG "....... : max redirection entries: %04X\n", reg_01.bits.entries); + + printk(KERN_DEBUG "....... : PRQ implemented: %X\n", reg_01.bits.PRQ); + printk(KERN_DEBUG "....... : IO APIC version: %04X\n", reg_01.bits.version); + + /* + * Some Intel chipsets with IO APIC VERSION of 0x1? don't have reg_02, + * but the value of reg_02 is read as the previous read register + * value, so ignore it if reg_02 == reg_01. + */ + if (reg_01.bits.version >= 0x10 && reg_02.raw != reg_01.raw) { + printk(KERN_DEBUG ".... register #02: %08X\n", reg_02.raw); + printk(KERN_DEBUG "....... : arbitration: %02X\n", reg_02.bits.arbitration); + } + + /* + * Some Intel chipsets with IO APIC VERSION of 0x2? don't have reg_02 + * or reg_03, but the value of reg_0[23] is read as the previous read + * register value, so ignore it if reg_03 == reg_0[12]. + */ + if (reg_01.bits.version >= 0x20 && reg_03.raw != reg_02.raw && + reg_03.raw != reg_01.raw) { + printk(KERN_DEBUG ".... register #03: %08X\n", reg_03.raw); + printk(KERN_DEBUG "....... : Boot DT : %X\n", reg_03.bits.boot_DT); + } + + printk(KERN_DEBUG ".... IRQ redirection table:\n"); + + printk(KERN_DEBUG " NR Dst Mask Trig IRR Pol" + " Stat Dmod Deli Vect: \n"); + + for (i = 0; i <= reg_01.bits.entries; i++) { + struct IO_APIC_route_entry entry; + + entry = ioapic_read_entry(apic, i); + + printk(KERN_DEBUG " %02x %03X ", + i, + entry.dest + ); + + printk("%1d %1d %1d %1d %1d %1d %1d %02X\n", + entry.mask, + entry.trigger, + entry.irr, + entry.polarity, + entry.delivery_status, + entry.dest_mode, + entry.delivery_mode, + entry.vector + ); + } + } + printk(KERN_DEBUG "IRQ to pin mappings:\n"); + for_each_irq_desc(irq, desc) { + struct irq_pin_list *entry; + + cfg = desc->chip_data; + entry = cfg->irq_2_pin; + if (!entry) + continue; + printk(KERN_DEBUG "IRQ%d ", irq); + for (;;) { + printk("-> %d:%d", entry->apic, entry->pin); + if (!entry->next) + break; + entry = entry->next; + } + printk("\n"); + } + + printk(KERN_INFO ".................................... done.\n"); + + return; +} + +__apicdebuginit(void) print_APIC_bitfield(int base) +{ + unsigned int v; + int i, j; + + if (apic_verbosity == APIC_QUIET) + return; + + printk(KERN_DEBUG "0123456789abcdef0123456789abcdef\n" KERN_DEBUG); + for (i = 0; i < 8; i++) { + v = apic_read(base + i*0x10); + for (j = 0; j < 32; j++) { + if (v & (1< 3) /* Due to the Pentium erratum 3AP. */ + apic_write(APIC_ESR, 0); + + v = apic_read(APIC_ESR); + printk(KERN_DEBUG "... APIC ESR: %08x\n", v); + } + + icr = apic_icr_read(); + printk(KERN_DEBUG "... APIC ICR: %08x\n", (u32)icr); + printk(KERN_DEBUG "... APIC ICR2: %08x\n", (u32)(icr >> 32)); + + v = apic_read(APIC_LVTT); + printk(KERN_DEBUG "... APIC LVTT: %08x\n", v); + + if (maxlvt > 3) { /* PC is LVT#4. */ + v = apic_read(APIC_LVTPC); + printk(KERN_DEBUG "... APIC LVTPC: %08x\n", v); + } + v = apic_read(APIC_LVT0); + printk(KERN_DEBUG "... APIC LVT0: %08x\n", v); + v = apic_read(APIC_LVT1); + printk(KERN_DEBUG "... APIC LVT1: %08x\n", v); + + if (maxlvt > 2) { /* ERR is LVT#3. */ + v = apic_read(APIC_LVTERR); + printk(KERN_DEBUG "... APIC LVTERR: %08x\n", v); + } + + v = apic_read(APIC_TMICT); + printk(KERN_DEBUG "... APIC TMICT: %08x\n", v); + v = apic_read(APIC_TMCCT); + printk(KERN_DEBUG "... APIC TMCCT: %08x\n", v); + v = apic_read(APIC_TDCR); + printk(KERN_DEBUG "... APIC TDCR: %08x\n", v); + printk("\n"); +} + +__apicdebuginit(void) print_all_local_APICs(void) +{ + int cpu; + + preempt_disable(); + for_each_online_cpu(cpu) + smp_call_function_single(cpu, print_local_APIC, NULL, 1); + preempt_enable(); +} + +__apicdebuginit(void) print_PIC(void) +{ + unsigned int v; + unsigned long flags; + + if (apic_verbosity == APIC_QUIET) + return; + + printk(KERN_DEBUG "\nprinting PIC contents\n"); + + spin_lock_irqsave(&i8259A_lock, flags); + + v = inb(0xa1) << 8 | inb(0x21); + printk(KERN_DEBUG "... PIC IMR: %04x\n", v); + + v = inb(0xa0) << 8 | inb(0x20); + printk(KERN_DEBUG "... PIC IRR: %04x\n", v); + + outb(0x0b,0xa0); + outb(0x0b,0x20); + v = inb(0xa0) << 8 | inb(0x20); + outb(0x0a,0xa0); + outb(0x0a,0x20); + + spin_unlock_irqrestore(&i8259A_lock, flags); + + printk(KERN_DEBUG "... PIC ISR: %04x\n", v); + + v = inb(0x4d1) << 8 | inb(0x4d0); + printk(KERN_DEBUG "... PIC ELCR: %04x\n", v); +} + +__apicdebuginit(int) print_all_ICs(void) +{ + print_PIC(); + print_all_local_APICs(); + print_IO_APIC(); + + return 0; +} + +fs_initcall(print_all_ICs); + + +/* Where if anywhere is the i8259 connect in external int mode */ +static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; + +void __init enable_IO_APIC(void) +{ + union IO_APIC_reg_01 reg_01; + int i8259_apic, i8259_pin; + int apic; + unsigned long flags; + + /* + * The number of IO-APIC IRQ registers (== #pins): + */ + for (apic = 0; apic < nr_ioapics; apic++) { + spin_lock_irqsave(&ioapic_lock, flags); + reg_01.raw = io_apic_read(apic, 1); + spin_unlock_irqrestore(&ioapic_lock, flags); + nr_ioapic_registers[apic] = reg_01.bits.entries+1; + } + for(apic = 0; apic < nr_ioapics; apic++) { + int pin; + /* See if any of the pins is in ExtINT mode */ + for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { + struct IO_APIC_route_entry entry; + entry = ioapic_read_entry(apic, pin); + + /* If the interrupt line is enabled and in ExtInt mode + * I have found the pin where the i8259 is connected. + */ + if ((entry.mask == 0) && (entry.delivery_mode == dest_ExtINT)) { + ioapic_i8259.apic = apic; + ioapic_i8259.pin = pin; + goto found_i8259; + } + } + } + found_i8259: + /* Look to see what if the MP table has reported the ExtINT */ + /* If we could not find the appropriate pin by looking at the ioapic + * the i8259 probably is not connected the ioapic but give the + * mptable a chance anyway. + */ + i8259_pin = find_isa_irq_pin(0, mp_ExtINT); + i8259_apic = find_isa_irq_apic(0, mp_ExtINT); + /* Trust the MP table if nothing is setup in the hardware */ + if ((ioapic_i8259.pin == -1) && (i8259_pin >= 0)) { + printk(KERN_WARNING "ExtINT not setup in hardware but reported by MP table\n"); + ioapic_i8259.pin = i8259_pin; + ioapic_i8259.apic = i8259_apic; + } + /* Complain if the MP table and the hardware disagree */ + if (((ioapic_i8259.apic != i8259_apic) || (ioapic_i8259.pin != i8259_pin)) && + (i8259_pin >= 0) && (ioapic_i8259.pin >= 0)) + { + printk(KERN_WARNING "ExtINT in hardware and MP table differ\n"); + } + + /* + * Do not trust the IO-APIC being empty at bootup + */ + clear_IO_APIC(); +} + +/* + * Not an __init, needed by the reboot code + */ +void disable_IO_APIC(void) +{ + /* + * Clear the IO-APIC before rebooting: + */ + clear_IO_APIC(); + + /* + * If the i8259 is routed through an IOAPIC + * Put that IOAPIC in virtual wire mode + * so legacy interrupts can be delivered. + */ + if (ioapic_i8259.pin != -1) { + struct IO_APIC_route_entry entry; + + memset(&entry, 0, sizeof(entry)); + entry.mask = 0; /* Enabled */ + entry.trigger = 0; /* Edge */ + entry.irr = 0; + entry.polarity = 0; /* High */ + entry.delivery_status = 0; + entry.dest_mode = 0; /* Physical */ + entry.delivery_mode = dest_ExtINT; /* ExtInt */ + entry.vector = 0; + entry.dest = read_apic_id(); + + /* + * Add it to the IO-APIC irq-routing table: + */ + ioapic_write_entry(ioapic_i8259.apic, ioapic_i8259.pin, entry); + } + + disconnect_bsp_APIC(ioapic_i8259.pin != -1); +} + +#ifdef CONFIG_X86_32 +/* + * function to set the IO-APIC physical IDs based on the + * values stored in the MPC table. + * + * by Matt Domsch Tue Dec 21 12:25:05 CST 1999 + */ + +static void __init setup_ioapic_ids_from_mpc(void) +{ + union IO_APIC_reg_00 reg_00; + physid_mask_t phys_id_present_map; + int apic_id; + int i; + unsigned char old_id; + unsigned long flags; + + if (x86_quirks->setup_ioapic_ids && x86_quirks->setup_ioapic_ids()) + return; + + /* + * Don't check I/O APIC IDs for xAPIC systems. They have + * no meaning without the serial APIC bus. + */ + if (!(boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + || APIC_XAPIC(apic_version[boot_cpu_physical_apicid])) + return; + /* + * This is broken; anything with a real cpu count has to + * circumvent this idiocy regardless. + */ + phys_id_present_map = apic->ioapic_phys_id_map(phys_cpu_present_map); + + /* + * Set the IOAPIC ID to the value stored in the MPC table. + */ + for (apic_id = 0; apic_id < nr_ioapics; apic_id++) { + + /* Read the register 0 value */ + spin_lock_irqsave(&ioapic_lock, flags); + reg_00.raw = io_apic_read(apic_id, 0); + spin_unlock_irqrestore(&ioapic_lock, flags); + + old_id = mp_ioapics[apic_id].apicid; + + if (mp_ioapics[apic_id].apicid >= get_physical_broadcast()) { + printk(KERN_ERR "BIOS bug, IO-APIC#%d ID is %d in the MPC table!...\n", + apic_id, mp_ioapics[apic_id].apicid); + printk(KERN_ERR "... fixing up to %d. (tell your hw vendor)\n", + reg_00.bits.ID); + mp_ioapics[apic_id].apicid = reg_00.bits.ID; + } + + /* + * Sanity check, is the ID really free? Every APIC in a + * system must have a unique ID or we get lots of nice + * 'stuck on smp_invalidate_needed IPI wait' messages. + */ + if (apic->check_apicid_used(phys_id_present_map, + mp_ioapics[apic_id].apicid)) { + printk(KERN_ERR "BIOS bug, IO-APIC#%d ID %d is already used!...\n", + apic_id, mp_ioapics[apic_id].apicid); + for (i = 0; i < get_physical_broadcast(); i++) + if (!physid_isset(i, phys_id_present_map)) + break; + if (i >= get_physical_broadcast()) + panic("Max APIC ID exceeded!\n"); + printk(KERN_ERR "... fixing up to %d. (tell your hw vendor)\n", + i); + physid_set(i, phys_id_present_map); + mp_ioapics[apic_id].apicid = i; + } else { + physid_mask_t tmp; + tmp = apic->apicid_to_cpu_present(mp_ioapics[apic_id].apicid); + apic_printk(APIC_VERBOSE, "Setting %d in the " + "phys_id_present_map\n", + mp_ioapics[apic_id].apicid); + physids_or(phys_id_present_map, phys_id_present_map, tmp); + } + + + /* + * We need to adjust the IRQ routing table + * if the ID changed. + */ + if (old_id != mp_ioapics[apic_id].apicid) + for (i = 0; i < mp_irq_entries; i++) + if (mp_irqs[i].dstapic == old_id) + mp_irqs[i].dstapic + = mp_ioapics[apic_id].apicid; + + /* + * Read the right value from the MPC table and + * write it into the ID register. + */ + apic_printk(APIC_VERBOSE, KERN_INFO + "...changing IO-APIC physical APIC ID to %d ...", + mp_ioapics[apic_id].apicid); + + reg_00.bits.ID = mp_ioapics[apic_id].apicid; + spin_lock_irqsave(&ioapic_lock, flags); + io_apic_write(apic_id, 0, reg_00.raw); + spin_unlock_irqrestore(&ioapic_lock, flags); + + /* + * Sanity check + */ + spin_lock_irqsave(&ioapic_lock, flags); + reg_00.raw = io_apic_read(apic_id, 0); + spin_unlock_irqrestore(&ioapic_lock, flags); + if (reg_00.bits.ID != mp_ioapics[apic_id].apicid) + printk("could not set ID!\n"); + else + apic_printk(APIC_VERBOSE, " ok.\n"); + } +} +#endif + +int no_timer_check __initdata; + +static int __init notimercheck(char *s) +{ + no_timer_check = 1; + return 1; +} +__setup("no_timer_check", notimercheck); + +/* + * There is a nasty bug in some older SMP boards, their mptable lies + * about the timer IRQ. We do the following to work around the situation: + * + * - timer IRQ defaults to IO-APIC IRQ + * - if this function detects that timer IRQs are defunct, then we fall + * back to ISA timer IRQs + */ +static int __init timer_irq_works(void) +{ + unsigned long t1 = jiffies; + unsigned long flags; + + if (no_timer_check) + return 1; + + local_save_flags(flags); + local_irq_enable(); + /* Let ten ticks pass... */ + mdelay((10 * 1000) / HZ); + local_irq_restore(flags); + + /* + * Expect a few ticks at least, to be sure some possible + * glue logic does not lock up after one or two first + * ticks in a non-ExtINT mode. Also the local APIC + * might have cached one ExtINT interrupt. Finally, at + * least one tick may be lost due to delays. + */ + + /* jiffies wrap? */ + if (time_after(jiffies, t1 + 4) && + time_before(jiffies, t1 + 16)) + return 1; + + return 0; +} + +/* + * In the SMP+IOAPIC case it might happen that there are an unspecified + * number of pending IRQ events unhandled. These cases are very rare, + * so we 'resend' these IRQs via IPIs, to the same CPU. It's much + * better to do it this way as thus we do not have to be aware of + * 'pending' interrupts in the IRQ path, except at this point. + */ +/* + * Edge triggered needs to resend any interrupt + * that was delayed but this is now handled in the device + * independent code. + */ + +/* + * Starting up a edge-triggered IO-APIC interrupt is + * nasty - we need to make sure that we get the edge. + * If it is already asserted for some reason, we need + * return 1 to indicate that is was pending. + * + * This is not complete - we should be able to fake + * an edge even if it isn't on the 8259A... + */ + +static unsigned int startup_ioapic_irq(unsigned int irq) +{ + int was_pending = 0; + unsigned long flags; + struct irq_cfg *cfg; + + spin_lock_irqsave(&ioapic_lock, flags); + if (irq < NR_IRQS_LEGACY) { + disable_8259A_irq(irq); + if (i8259A_irq_pending(irq)) + was_pending = 1; + } + cfg = irq_cfg(irq); + __unmask_IO_APIC_irq(cfg); + spin_unlock_irqrestore(&ioapic_lock, flags); + + return was_pending; +} + +#ifdef CONFIG_X86_64 +static int ioapic_retrigger_irq(unsigned int irq) +{ + + struct irq_cfg *cfg = irq_cfg(irq); + unsigned long flags; + + spin_lock_irqsave(&vector_lock, flags); + apic->send_IPI_mask(cpumask_of(cpumask_first(cfg->domain)), cfg->vector); + spin_unlock_irqrestore(&vector_lock, flags); + + return 1; +} +#else +static int ioapic_retrigger_irq(unsigned int irq) +{ + apic->send_IPI_self(irq_cfg(irq)->vector); + + return 1; +} +#endif + +/* + * Level and edge triggered IO-APIC interrupts need different handling, + * so we use two separate IRQ descriptors. Edge triggered IRQs can be + * handled with the level-triggered descriptor, but that one has slightly + * more overhead. Level-triggered interrupts cannot be handled with the + * edge-triggered handler, without risking IRQ storms and other ugly + * races. + */ + +#ifdef CONFIG_SMP + +#ifdef CONFIG_INTR_REMAP +static void ir_irq_migration(struct work_struct *work); + +static DECLARE_DELAYED_WORK(ir_migration_work, ir_irq_migration); + +/* + * Migrate the IO-APIC irq in the presence of intr-remapping. + * + * For edge triggered, irq migration is a simple atomic update(of vector + * and cpu destination) of IRTE and flush the hardware cache. + * + * For level triggered, we need to modify the io-apic RTE aswell with the update + * vector information, along with modifying IRTE with vector and destination. + * So irq migration for level triggered is little bit more complex compared to + * edge triggered migration. But the good news is, we use the same algorithm + * for level triggered migration as we have today, only difference being, + * we now initiate the irq migration from process context instead of the + * interrupt context. + * + * In future, when we do a directed EOI (combined with cpu EOI broadcast + * suppression) to the IO-APIC, level triggered irq migration will also be + * as simple as edge triggered migration and we can do the irq migration + * with a simple atomic update to IO-APIC RTE. + */ +static void +migrate_ioapic_irq_desc(struct irq_desc *desc, const struct cpumask *mask) +{ + struct irq_cfg *cfg; + struct irte irte; + int modify_ioapic_rte; + unsigned int dest; + unsigned long flags; + unsigned int irq; + + if (!cpumask_intersects(mask, cpu_online_mask)) + return; + + irq = desc->irq; + if (get_irte(irq, &irte)) + return; + + cfg = desc->chip_data; + if (assign_irq_vector(irq, cfg, mask)) + return; + + set_extra_move_desc(desc, mask); + + dest = apic->cpu_mask_to_apicid_and(cfg->domain, mask); + + modify_ioapic_rte = desc->status & IRQ_LEVEL; + if (modify_ioapic_rte) { + spin_lock_irqsave(&ioapic_lock, flags); + __target_IO_APIC_irq(irq, dest, cfg); + spin_unlock_irqrestore(&ioapic_lock, flags); + } + + irte.vector = cfg->vector; + irte.dest_id = IRTE_DEST(dest); + + /* + * Modified the IRTE and flushes the Interrupt entry cache. + */ + modify_irte(irq, &irte); + + if (cfg->move_in_progress) + send_cleanup_vector(cfg); + + cpumask_copy(desc->affinity, mask); +} + +static int migrate_irq_remapped_level_desc(struct irq_desc *desc) +{ + int ret = -1; + struct irq_cfg *cfg = desc->chip_data; + + mask_IO_APIC_irq_desc(desc); + + if (io_apic_level_ack_pending(cfg)) { + /* + * Interrupt in progress. Migrating irq now will change the + * vector information in the IO-APIC RTE and that will confuse + * the EOI broadcast performed by cpu. + * So, delay the irq migration to the next instance. + */ + schedule_delayed_work(&ir_migration_work, 1); + goto unmask; + } + + /* everthing is clear. we have right of way */ + migrate_ioapic_irq_desc(desc, desc->pending_mask); + + ret = 0; + desc->status &= ~IRQ_MOVE_PENDING; + cpumask_clear(desc->pending_mask); + +unmask: + unmask_IO_APIC_irq_desc(desc); + + return ret; +} + +static void ir_irq_migration(struct work_struct *work) +{ + unsigned int irq; + struct irq_desc *desc; + + for_each_irq_desc(irq, desc) { + if (desc->status & IRQ_MOVE_PENDING) { + unsigned long flags; + + spin_lock_irqsave(&desc->lock, flags); + if (!desc->chip->set_affinity || + !(desc->status & IRQ_MOVE_PENDING)) { + desc->status &= ~IRQ_MOVE_PENDING; + spin_unlock_irqrestore(&desc->lock, flags); + continue; + } + + desc->chip->set_affinity(irq, desc->pending_mask); + spin_unlock_irqrestore(&desc->lock, flags); + } + } +} + +/* + * Migrates the IRQ destination in the process context. + */ +static void set_ir_ioapic_affinity_irq_desc(struct irq_desc *desc, + const struct cpumask *mask) +{ + if (desc->status & IRQ_LEVEL) { + desc->status |= IRQ_MOVE_PENDING; + cpumask_copy(desc->pending_mask, mask); + migrate_irq_remapped_level_desc(desc); + return; + } + + migrate_ioapic_irq_desc(desc, mask); +} +static void set_ir_ioapic_affinity_irq(unsigned int irq, + const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + + set_ir_ioapic_affinity_irq_desc(desc, mask); +} +#endif + +asmlinkage void smp_irq_move_cleanup_interrupt(void) +{ + unsigned vector, me; + + ack_APIC_irq(); + exit_idle(); + irq_enter(); + + me = smp_processor_id(); + for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) { + unsigned int irq; + struct irq_desc *desc; + struct irq_cfg *cfg; + irq = __get_cpu_var(vector_irq)[vector]; + + if (irq == -1) + continue; + + desc = irq_to_desc(irq); + if (!desc) + continue; + + cfg = irq_cfg(irq); + spin_lock(&desc->lock); + if (!cfg->move_cleanup_count) + goto unlock; + + if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) + goto unlock; + + __get_cpu_var(vector_irq)[vector] = -1; + cfg->move_cleanup_count--; +unlock: + spin_unlock(&desc->lock); + } + + irq_exit(); +} + +static void irq_complete_move(struct irq_desc **descp) +{ + struct irq_desc *desc = *descp; + struct irq_cfg *cfg = desc->chip_data; + unsigned vector, me; + + if (likely(!cfg->move_in_progress)) { +#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC + if (likely(!cfg->move_desc_pending)) + return; + + /* domain has not changed, but affinity did */ + me = smp_processor_id(); + if (cpumask_test_cpu(me, desc->affinity)) { + *descp = desc = move_irq_desc(desc, me); + /* get the new one */ + cfg = desc->chip_data; + cfg->move_desc_pending = 0; + } +#endif + return; + } + + vector = ~get_irq_regs()->orig_ax; + me = smp_processor_id(); + + if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) { +#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC + *descp = desc = move_irq_desc(desc, me); + /* get the new one */ + cfg = desc->chip_data; +#endif + send_cleanup_vector(cfg); + } +} +#else +static inline void irq_complete_move(struct irq_desc **descp) {} +#endif + +#ifdef CONFIG_INTR_REMAP +static void ack_x2apic_level(unsigned int irq) +{ + ack_x2APIC_irq(); +} + +static void ack_x2apic_edge(unsigned int irq) +{ + ack_x2APIC_irq(); +} + +#endif + +static void ack_apic_edge(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + + irq_complete_move(&desc); + move_native_irq(irq); + ack_APIC_irq(); +} + +atomic_t irq_mis_count; + +static void ack_apic_level(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + +#ifdef CONFIG_X86_32 + unsigned long v; + int i; +#endif + struct irq_cfg *cfg; + int do_unmask_irq = 0; + + irq_complete_move(&desc); +#ifdef CONFIG_GENERIC_PENDING_IRQ + /* If we are moving the irq we need to mask it */ + if (unlikely(desc->status & IRQ_MOVE_PENDING) && + !(desc->status & IRQ_INPROGRESS)) { + do_unmask_irq = 1; + mask_IO_APIC_irq_desc(desc); + } +#endif + +#ifdef CONFIG_X86_32 + /* + * It appears there is an erratum which affects at least version 0x11 + * of I/O APIC (that's the 82093AA and cores integrated into various + * chipsets). Under certain conditions a level-triggered interrupt is + * erroneously delivered as edge-triggered one but the respective IRR + * bit gets set nevertheless. As a result the I/O unit expects an EOI + * message but it will never arrive and further interrupts are blocked + * from the source. The exact reason is so far unknown, but the + * phenomenon was observed when two consecutive interrupt requests + * from a given source get delivered to the same CPU and the source is + * temporarily disabled in between. + * + * A workaround is to simulate an EOI message manually. We achieve it + * by setting the trigger mode to edge and then to level when the edge + * trigger mode gets detected in the TMR of a local APIC for a + * level-triggered interrupt. We mask the source for the time of the + * operation to prevent an edge-triggered interrupt escaping meanwhile. + * The idea is from Manfred Spraul. --macro + */ + cfg = desc->chip_data; + i = cfg->vector; + + v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1)); +#endif + + /* + * We must acknowledge the irq before we move it or the acknowledge will + * not propagate properly. + */ + ack_APIC_irq(); + + /* Now we can move and renable the irq */ + if (unlikely(do_unmask_irq)) { + /* Only migrate the irq if the ack has been received. + * + * On rare occasions the broadcast level triggered ack gets + * delayed going to ioapics, and if we reprogram the + * vector while Remote IRR is still set the irq will never + * fire again. + * + * To prevent this scenario we read the Remote IRR bit + * of the ioapic. This has two effects. + * - On any sane system the read of the ioapic will + * flush writes (and acks) going to the ioapic from + * this cpu. + * - We get to see if the ACK has actually been delivered. + * + * Based on failed experiments of reprogramming the + * ioapic entry from outside of irq context starting + * with masking the ioapic entry and then polling until + * Remote IRR was clear before reprogramming the + * ioapic I don't trust the Remote IRR bit to be + * completey accurate. + * + * However there appears to be no other way to plug + * this race, so if the Remote IRR bit is not + * accurate and is causing problems then it is a hardware bug + * and you can go talk to the chipset vendor about it. + */ + cfg = desc->chip_data; + if (!io_apic_level_ack_pending(cfg)) + move_masked_irq(irq); + unmask_IO_APIC_irq_desc(desc); + } +#if (defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE)) && \ + defined(CONFIG_PREEMPT_HARDIRQS) + /* + * With threaded interrupts, we always have IRQ_INPROGRESS + * when acking. + */ + else if (unlikely(desc->status & IRQ_MOVE_PENDING)) + move_masked_irq(irq); +#endif + +#ifdef CONFIG_X86_32 + if (!(v & (1 << (i & 0x1f)))) { + atomic_inc(&irq_mis_count); + spin_lock(&ioapic_lock); + __mask_and_edge_IO_APIC_irq(cfg); + __unmask_and_level_IO_APIC_irq(cfg); + spin_unlock(&ioapic_lock); + } +#endif +} + +static struct irq_chip ioapic_chip __read_mostly = { + .name = "IO-APIC", + .startup = startup_ioapic_irq, + .mask = mask_IO_APIC_irq, + .unmask = unmask_IO_APIC_irq, + .ack = ack_apic_edge, + .eoi = ack_apic_level, +#ifdef CONFIG_SMP + .set_affinity = set_ioapic_affinity_irq, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +#ifdef CONFIG_INTR_REMAP +static struct irq_chip ir_ioapic_chip __read_mostly = { + .name = "IR-IO-APIC", + .startup = startup_ioapic_irq, + .mask = mask_IO_APIC_irq, + .unmask = unmask_IO_APIC_irq, + .ack = ack_x2apic_edge, + .eoi = ack_x2apic_level, +#ifdef CONFIG_SMP + .set_affinity = set_ir_ioapic_affinity_irq, +#endif + .retrigger = ioapic_retrigger_irq, +}; +#endif + +static inline void init_IO_APIC_traps(void) +{ + int irq; + struct irq_desc *desc; + struct irq_cfg *cfg; + + /* + * NOTE! The local APIC isn't very good at handling + * multiple interrupts at the same interrupt level. + * As the interrupt level is determined by taking the + * vector number and shifting that right by 4, we + * want to spread these out a bit so that they don't + * all fall in the same interrupt level. + * + * Also, we've got to be careful not to trash gate + * 0x80, because int 0x80 is hm, kind of importantish. ;) + */ + for_each_irq_desc(irq, desc) { + cfg = desc->chip_data; + if (IO_APIC_IRQ(irq) && cfg && !cfg->vector) { + /* + * Hmm.. We don't have an entry for this, + * so default to an old-fashioned 8259 + * interrupt if we can.. + */ + if (irq < NR_IRQS_LEGACY) + make_8259A_irq(irq); + else + /* Strange. Oh, well.. */ + desc->chip = &no_irq_chip; + } + } +} + +/* + * The local APIC irq-chip implementation: + */ + +static void mask_lapic_irq(unsigned int irq) +{ + unsigned long v; + + v = apic_read(APIC_LVT0); + apic_write(APIC_LVT0, v | APIC_LVT_MASKED); +} + +static void unmask_lapic_irq(unsigned int irq) +{ + unsigned long v; + + v = apic_read(APIC_LVT0); + apic_write(APIC_LVT0, v & ~APIC_LVT_MASKED); +} + +static void ack_lapic_irq(unsigned int irq) +{ + ack_APIC_irq(); +} + +static struct irq_chip lapic_chip __read_mostly = { + .name = "local-APIC", + .mask = mask_lapic_irq, + .unmask = unmask_lapic_irq, + .ack = ack_lapic_irq, +}; + +static void lapic_register_intr(int irq, struct irq_desc *desc) +{ + desc->status &= ~IRQ_LEVEL; + set_irq_chip_and_handler_name(irq, &lapic_chip, handle_edge_irq, + "edge"); +} + +static void __init setup_nmi(void) +{ + /* + * Dirty trick to enable the NMI watchdog ... + * We put the 8259A master into AEOI mode and + * unmask on all local APICs LVT0 as NMI. + * + * The idea to use the 8259A in AEOI mode ('8259A Virtual Wire') + * is from Maciej W. Rozycki - so we do not have to EOI from + * the NMI handler or the timer interrupt. + */ + apic_printk(APIC_VERBOSE, KERN_INFO "activating NMI Watchdog ..."); + + enable_NMI_through_LVT0(); + + apic_printk(APIC_VERBOSE, " done.\n"); +} + +/* + * This looks a bit hackish but it's about the only one way of sending + * a few INTA cycles to 8259As and any associated glue logic. ICR does + * not support the ExtINT mode, unfortunately. We need to send these + * cycles as some i82489DX-based boards have glue logic that keeps the + * 8259A interrupt line asserted until INTA. --macro + */ +static inline void __init unlock_ExtINT_logic(void) +{ + int apic, pin, i; + struct IO_APIC_route_entry entry0, entry1; + unsigned char save_control, save_freq_select; + + pin = find_isa_irq_pin(8, mp_INT); + if (pin == -1) { + WARN_ON_ONCE(1); + return; + } + apic = find_isa_irq_apic(8, mp_INT); + if (apic == -1) { + WARN_ON_ONCE(1); + return; + } + + entry0 = ioapic_read_entry(apic, pin); + clear_IO_APIC_pin(apic, pin); + + memset(&entry1, 0, sizeof(entry1)); + + entry1.dest_mode = 0; /* physical delivery */ + entry1.mask = 0; /* unmask IRQ now */ + entry1.dest = hard_smp_processor_id(); + entry1.delivery_mode = dest_ExtINT; + entry1.polarity = entry0.polarity; + entry1.trigger = 0; + entry1.vector = 0; + + ioapic_write_entry(apic, pin, entry1); + + save_control = CMOS_READ(RTC_CONTROL); + save_freq_select = CMOS_READ(RTC_FREQ_SELECT); + CMOS_WRITE((save_freq_select & ~RTC_RATE_SELECT) | 0x6, + RTC_FREQ_SELECT); + CMOS_WRITE(save_control | RTC_PIE, RTC_CONTROL); + + i = 100; + while (i-- > 0) { + mdelay(10); + if ((CMOS_READ(RTC_INTR_FLAGS) & RTC_PF) == RTC_PF) + i -= 10; + } + + CMOS_WRITE(save_control, RTC_CONTROL); + CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT); + clear_IO_APIC_pin(apic, pin); + + ioapic_write_entry(apic, pin, entry0); +} + +static int disable_timer_pin_1 __initdata; +/* Actually the next is obsolete, but keep it for paranoid reasons -AK */ +static int __init disable_timer_pin_setup(char *arg) +{ + disable_timer_pin_1 = 1; + return 0; +} +early_param("disable_timer_pin_1", disable_timer_pin_setup); + +int timer_through_8259 __initdata; + +/* + * This code may look a bit paranoid, but it's supposed to cooperate with + * a wide range of boards and BIOS bugs. Fortunately only the timer IRQ + * is so screwy. Thanks to Brian Perkins for testing/hacking this beast + * fanatically on his truly buggy board. + * + * FIXME: really need to revamp this for all platforms. + */ +static inline void __init check_timer(void) +{ + struct irq_desc *desc = irq_to_desc(0); + struct irq_cfg *cfg = desc->chip_data; + int cpu = boot_cpu_id; + int apic1, pin1, apic2, pin2; + unsigned long flags; + int no_pin1 = 0; + + local_irq_save(flags); + + /* + * get/set the timer IRQ vector: + */ + disable_8259A_irq(0); + assign_irq_vector(0, cfg, apic->target_cpus()); + + /* + * As IRQ0 is to be enabled in the 8259A, the virtual + * wire has to be disabled in the local APIC. Also + * timer interrupts need to be acknowledged manually in + * the 8259A for the i82489DX when using the NMI + * watchdog as that APIC treats NMIs as level-triggered. + * The AEOI mode will finish them in the 8259A + * automatically. + */ + apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT); + init_8259A(1); +#ifdef CONFIG_X86_32 + { + unsigned int ver; + + ver = apic_read(APIC_LVR); + ver = GET_APIC_VERSION(ver); + timer_ack = (nmi_watchdog == NMI_IO_APIC && !APIC_INTEGRATED(ver)); + } +#endif + + pin1 = find_isa_irq_pin(0, mp_INT); + apic1 = find_isa_irq_apic(0, mp_INT); + pin2 = ioapic_i8259.pin; + apic2 = ioapic_i8259.apic; + + apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X " + "apic1=%d pin1=%d apic2=%d pin2=%d\n", + cfg->vector, apic1, pin1, apic2, pin2); + + /* + * Some BIOS writers are clueless and report the ExtINTA + * I/O APIC input from the cascaded 8259A as the timer + * interrupt input. So just in case, if only one pin + * was found above, try it both directly and through the + * 8259A. + */ + if (pin1 == -1) { +#ifdef CONFIG_INTR_REMAP + if (intr_remapping_enabled) + panic("BIOS bug: timer not connected to IO-APIC"); +#endif + pin1 = pin2; + apic1 = apic2; + no_pin1 = 1; + } else if (pin2 == -1) { + pin2 = pin1; + apic2 = apic1; + } + + if (pin1 != -1) { + /* + * Ok, does IRQ0 through the IOAPIC work? + */ + if (no_pin1) { + add_pin_to_irq_cpu(cfg, cpu, apic1, pin1); + setup_timer_IRQ0_pin(apic1, pin1, cfg->vector); + } else { + /* for edge trigger, setup_IO_APIC_irq already + * leave it unmasked. + * so only need to unmask if it is level-trigger + * do we really have level trigger timer? + */ + int idx; + idx = find_irq_entry(apic1, pin1, mp_INT); + if (idx != -1 && irq_trigger(idx)) + unmask_IO_APIC_irq_desc(desc); + } + if (timer_irq_works()) { + if (nmi_watchdog == NMI_IO_APIC) { + setup_nmi(); + enable_8259A_irq(0); + } + if (disable_timer_pin_1 > 0) + clear_IO_APIC_pin(0, pin1); + goto out; + } +#ifdef CONFIG_INTR_REMAP + if (intr_remapping_enabled) + panic("timer doesn't work through Interrupt-remapped IO-APIC"); +#endif + local_irq_disable(); + clear_IO_APIC_pin(apic1, pin1); + if (!no_pin1) + apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: " + "8254 timer not connected to IO-APIC\n"); + + apic_printk(APIC_QUIET, KERN_INFO "...trying to set up timer " + "(IRQ0) through the 8259A ...\n"); + apic_printk(APIC_QUIET, KERN_INFO + "..... (found apic %d pin %d) ...\n", apic2, pin2); + /* + * legacy devices should be connected to IO APIC #0 + */ + replace_pin_at_irq_cpu(cfg, cpu, apic1, pin1, apic2, pin2); + setup_timer_IRQ0_pin(apic2, pin2, cfg->vector); + enable_8259A_irq(0); + if (timer_irq_works()) { + apic_printk(APIC_QUIET, KERN_INFO "....... works.\n"); + timer_through_8259 = 1; + if (nmi_watchdog == NMI_IO_APIC) { + disable_8259A_irq(0); + setup_nmi(); + enable_8259A_irq(0); + } + goto out; + } + /* + * Cleanup, just in case ... + */ + local_irq_disable(); + disable_8259A_irq(0); + clear_IO_APIC_pin(apic2, pin2); + apic_printk(APIC_QUIET, KERN_INFO "....... failed.\n"); + } + + if (nmi_watchdog == NMI_IO_APIC) { + apic_printk(APIC_QUIET, KERN_WARNING "timer doesn't work " + "through the IO-APIC - disabling NMI Watchdog!\n"); + nmi_watchdog = NMI_NONE; + } +#ifdef CONFIG_X86_32 + timer_ack = 0; +#endif + + apic_printk(APIC_QUIET, KERN_INFO + "...trying to set up timer as Virtual Wire IRQ...\n"); + + lapic_register_intr(0, desc); + apic_write(APIC_LVT0, APIC_DM_FIXED | cfg->vector); /* Fixed mode */ + enable_8259A_irq(0); + + if (timer_irq_works()) { + apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); + goto out; + } + local_irq_disable(); + disable_8259A_irq(0); + apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector); + apic_printk(APIC_QUIET, KERN_INFO "..... failed.\n"); + + apic_printk(APIC_QUIET, KERN_INFO + "...trying to set up timer as ExtINT IRQ...\n"); + + init_8259A(0); + make_8259A_irq(0); + apic_write(APIC_LVT0, APIC_DM_EXTINT); + + unlock_ExtINT_logic(); + + if (timer_irq_works()) { + apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); + goto out; + } + local_irq_disable(); + apic_printk(APIC_QUIET, KERN_INFO "..... failed :(.\n"); + panic("IO-APIC + timer doesn't work! Boot with apic=debug and send a " + "report. Then try booting with the 'noapic' option.\n"); +out: + local_irq_restore(flags); +} + +/* + * Traditionally ISA IRQ2 is the cascade IRQ, and is not available + * to devices. However there may be an I/O APIC pin available for + * this interrupt regardless. The pin may be left unconnected, but + * typically it will be reused as an ExtINT cascade interrupt for + * the master 8259A. In the MPS case such a pin will normally be + * reported as an ExtINT interrupt in the MP table. With ACPI + * there is no provision for ExtINT interrupts, and in the absence + * of an override it would be treated as an ordinary ISA I/O APIC + * interrupt, that is edge-triggered and unmasked by default. We + * used to do this, but it caused problems on some systems because + * of the NMI watchdog and sometimes IRQ0 of the 8254 timer using + * the same ExtINT cascade interrupt to drive the local APIC of the + * bootstrap processor. Therefore we refrain from routing IRQ2 to + * the I/O APIC in all cases now. No actual device should request + * it anyway. --macro + */ +#define PIC_IRQS (1 << PIC_CASCADE_IR) + +void __init setup_IO_APIC(void) +{ + + /* + * calling enable_IO_APIC() is moved to setup_local_APIC for BP + */ + + io_apic_irqs = ~PIC_IRQS; + + apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n"); + /* + * Set up IO-APIC IRQ routing. + */ +#ifdef CONFIG_X86_32 + if (!acpi_ioapic) + setup_ioapic_ids_from_mpc(); +#endif + sync_Arb_IDs(); + setup_IO_APIC_irqs(); + init_IO_APIC_traps(); + check_timer(); +} + +/* + * Called after all the initialization is done. If we didnt find any + * APIC bugs then we can allow the modify fast path + */ + +static int __init io_apic_bug_finalize(void) +{ + if (sis_apic_bug == -1) + sis_apic_bug = 0; + return 0; +} + +late_initcall(io_apic_bug_finalize); + +struct sysfs_ioapic_data { + struct sys_device dev; + struct IO_APIC_route_entry entry[0]; +}; +static struct sysfs_ioapic_data * mp_ioapic_data[MAX_IO_APICS]; + +static int ioapic_suspend(struct sys_device *dev, pm_message_t state) +{ + struct IO_APIC_route_entry *entry; + struct sysfs_ioapic_data *data; + int i; + + data = container_of(dev, struct sysfs_ioapic_data, dev); + entry = data->entry; + for (i = 0; i < nr_ioapic_registers[dev->id]; i ++, entry ++ ) + *entry = ioapic_read_entry(dev->id, i); + + return 0; +} + +static int ioapic_resume(struct sys_device *dev) +{ + struct IO_APIC_route_entry *entry; + struct sysfs_ioapic_data *data; + unsigned long flags; + union IO_APIC_reg_00 reg_00; + int i; + + data = container_of(dev, struct sysfs_ioapic_data, dev); + entry = data->entry; + + spin_lock_irqsave(&ioapic_lock, flags); + reg_00.raw = io_apic_read(dev->id, 0); + if (reg_00.bits.ID != mp_ioapics[dev->id].apicid) { + reg_00.bits.ID = mp_ioapics[dev->id].apicid; + io_apic_write(dev->id, 0, reg_00.raw); + } + spin_unlock_irqrestore(&ioapic_lock, flags); + for (i = 0; i < nr_ioapic_registers[dev->id]; i++) + ioapic_write_entry(dev->id, i, entry[i]); + + return 0; +} + +static struct sysdev_class ioapic_sysdev_class = { + .name = "ioapic", + .suspend = ioapic_suspend, + .resume = ioapic_resume, +}; + +static int __init ioapic_init_sysfs(void) +{ + struct sys_device * dev; + int i, size, error; + + error = sysdev_class_register(&ioapic_sysdev_class); + if (error) + return error; + + for (i = 0; i < nr_ioapics; i++ ) { + size = sizeof(struct sys_device) + nr_ioapic_registers[i] + * sizeof(struct IO_APIC_route_entry); + mp_ioapic_data[i] = kzalloc(size, GFP_KERNEL); + if (!mp_ioapic_data[i]) { + printk(KERN_ERR "Can't suspend/resume IOAPIC %d\n", i); + continue; + } + dev = &mp_ioapic_data[i]->dev; + dev->id = i; + dev->cls = &ioapic_sysdev_class; + error = sysdev_register(dev); + if (error) { + kfree(mp_ioapic_data[i]); + mp_ioapic_data[i] = NULL; + printk(KERN_ERR "Can't suspend/resume IOAPIC %d\n", i); + continue; + } + } + + return 0; +} + +device_initcall(ioapic_init_sysfs); + +static int nr_irqs_gsi = NR_IRQS_LEGACY; +/* + * Dynamic irq allocate and deallocation + */ +unsigned int create_irq_nr(unsigned int irq_want) +{ + /* Allocate an unused irq */ + unsigned int irq; + unsigned int new; + unsigned long flags; + struct irq_cfg *cfg_new = NULL; + int cpu = boot_cpu_id; + struct irq_desc *desc_new = NULL; + + irq = 0; + if (irq_want < nr_irqs_gsi) + irq_want = nr_irqs_gsi; + + spin_lock_irqsave(&vector_lock, flags); + for (new = irq_want; new < nr_irqs; new++) { + desc_new = irq_to_desc_alloc_cpu(new, cpu); + if (!desc_new) { + printk(KERN_INFO "can not get irq_desc for %d\n", new); + continue; + } + cfg_new = desc_new->chip_data; + + if (cfg_new->vector != 0) + continue; + if (__assign_irq_vector(new, cfg_new, apic->target_cpus()) == 0) + irq = new; + break; + } + spin_unlock_irqrestore(&vector_lock, flags); + + if (irq > 0) { + dynamic_irq_init(irq); + /* restore it, in case dynamic_irq_init clear it */ + if (desc_new) + desc_new->chip_data = cfg_new; + } + return irq; +} + +int create_irq(void) +{ + unsigned int irq_want; + int irq; + + irq_want = nr_irqs_gsi; + irq = create_irq_nr(irq_want); + + if (irq == 0) + irq = -1; + + return irq; +} + +void destroy_irq(unsigned int irq) +{ + unsigned long flags; + struct irq_cfg *cfg; + struct irq_desc *desc; + + /* store it, in case dynamic_irq_cleanup clear it */ + desc = irq_to_desc(irq); + cfg = desc->chip_data; + dynamic_irq_cleanup(irq); + /* connect back irq_cfg */ + if (desc) + desc->chip_data = cfg; + +#ifdef CONFIG_INTR_REMAP + free_irte(irq); +#endif + spin_lock_irqsave(&vector_lock, flags); + __clear_irq_vector(irq, cfg); + spin_unlock_irqrestore(&vector_lock, flags); +} + +/* + * MSI message composition + */ +#ifdef CONFIG_PCI_MSI +static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_msg *msg) +{ + struct irq_cfg *cfg; + int err; + unsigned dest; + + if (disable_apic) + return -ENXIO; + + cfg = irq_cfg(irq); + err = assign_irq_vector(irq, cfg, apic->target_cpus()); + if (err) + return err; + + dest = apic->cpu_mask_to_apicid_and(cfg->domain, apic->target_cpus()); + +#ifdef CONFIG_INTR_REMAP + if (irq_remapped(irq)) { + struct irte irte; + int ir_index; + u16 sub_handle; + + ir_index = map_irq_to_irte_handle(irq, &sub_handle); + BUG_ON(ir_index == -1); + + memset (&irte, 0, sizeof(irte)); + + irte.present = 1; + irte.dst_mode = apic->irq_dest_mode; + irte.trigger_mode = 0; /* edge */ + irte.dlvry_mode = apic->irq_delivery_mode; + irte.vector = cfg->vector; + irte.dest_id = IRTE_DEST(dest); + + modify_irte(irq, &irte); + + msg->address_hi = MSI_ADDR_BASE_HI; + msg->data = sub_handle; + msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT | + MSI_ADDR_IR_SHV | + MSI_ADDR_IR_INDEX1(ir_index) | + MSI_ADDR_IR_INDEX2(ir_index); + } else +#endif + { + msg->address_hi = MSI_ADDR_BASE_HI; + msg->address_lo = + MSI_ADDR_BASE_LO | + ((apic->irq_dest_mode == 0) ? + MSI_ADDR_DEST_MODE_PHYSICAL: + MSI_ADDR_DEST_MODE_LOGICAL) | + ((apic->irq_delivery_mode != dest_LowestPrio) ? + MSI_ADDR_REDIRECTION_CPU: + MSI_ADDR_REDIRECTION_LOWPRI) | + MSI_ADDR_DEST_ID(dest); + + msg->data = + MSI_DATA_TRIGGER_EDGE | + MSI_DATA_LEVEL_ASSERT | + ((apic->irq_delivery_mode != dest_LowestPrio) ? + MSI_DATA_DELIVERY_FIXED: + MSI_DATA_DELIVERY_LOWPRI) | + MSI_DATA_VECTOR(cfg->vector); + } + return err; +} + +#ifdef CONFIG_SMP +static void set_msi_irq_affinity(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irq_cfg *cfg; + struct msi_msg msg; + unsigned int dest; + + dest = set_desc_affinity(desc, mask); + if (dest == BAD_APICID) + return; + + cfg = desc->chip_data; + + read_msi_msg_desc(desc, &msg); + + msg.data &= ~MSI_DATA_VECTOR_MASK; + msg.data |= MSI_DATA_VECTOR(cfg->vector); + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; + msg.address_lo |= MSI_ADDR_DEST_ID(dest); + + write_msi_msg_desc(desc, &msg); +} +#ifdef CONFIG_INTR_REMAP +/* + * Migrate the MSI irq to another cpumask. This migration is + * done in the process context using interrupt-remapping hardware. + */ +static void +ir_set_msi_irq_affinity(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irq_cfg *cfg = desc->chip_data; + unsigned int dest; + struct irte irte; + + if (get_irte(irq, &irte)) + return; + + dest = set_desc_affinity(desc, mask); + if (dest == BAD_APICID) + return; + + irte.vector = cfg->vector; + irte.dest_id = IRTE_DEST(dest); + + /* + * atomically update the IRTE with the new destination and vector. + */ + modify_irte(irq, &irte); + + /* + * After this point, all the interrupts will start arriving + * at the new destination. So, time to cleanup the previous + * vector allocation. + */ + if (cfg->move_in_progress) + send_cleanup_vector(cfg); +} + +#endif +#endif /* CONFIG_SMP */ + +/* + * IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices, + * which implement the MSI or MSI-X Capability Structure. + */ +static struct irq_chip msi_chip = { + .name = "PCI-MSI", + .unmask = unmask_msi_irq, + .mask = mask_msi_irq, + .ack = ack_apic_edge, +#ifdef CONFIG_SMP + .set_affinity = set_msi_irq_affinity, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +#ifdef CONFIG_INTR_REMAP +static struct irq_chip msi_ir_chip = { + .name = "IR-PCI-MSI", + .unmask = unmask_msi_irq, + .mask = mask_msi_irq, + .ack = ack_x2apic_edge, +#ifdef CONFIG_SMP + .set_affinity = ir_set_msi_irq_affinity, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +/* + * Map the PCI dev to the corresponding remapping hardware unit + * and allocate 'nvec' consecutive interrupt-remapping table entries + * in it. + */ +static int msi_alloc_irte(struct pci_dev *dev, int irq, int nvec) +{ + struct intel_iommu *iommu; + int index; + + iommu = map_dev_to_ir(dev); + if (!iommu) { + printk(KERN_ERR + "Unable to map PCI %s to iommu\n", pci_name(dev)); + return -ENOENT; + } + + index = alloc_irte(iommu, irq, nvec); + if (index < 0) { + printk(KERN_ERR + "Unable to allocate %d IRTE for PCI %s\n", nvec, + pci_name(dev)); + return -ENOSPC; + } + return index; +} +#endif + +static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int irq) +{ + int ret; + struct msi_msg msg; + + ret = msi_compose_msg(dev, irq, &msg); + if (ret < 0) + return ret; + + set_irq_msi(irq, msidesc); + write_msi_msg(irq, &msg); + +#ifdef CONFIG_INTR_REMAP + if (irq_remapped(irq)) { + struct irq_desc *desc = irq_to_desc(irq); + /* + * irq migration in process context + */ + desc->status |= IRQ_MOVE_PCNTXT; + set_irq_chip_and_handler_name(irq, &msi_ir_chip, handle_edge_irq, "edge"); + } else +#endif + set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge"); + + dev_printk(KERN_DEBUG, &dev->dev, "irq %d for MSI/MSI-X\n", irq); + + return 0; +} + +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) +{ + unsigned int irq; + int ret, sub_handle; + struct msi_desc *msidesc; + unsigned int irq_want; + +#ifdef CONFIG_INTR_REMAP + struct intel_iommu *iommu = 0; + int index = 0; +#endif + + irq_want = nr_irqs_gsi; + sub_handle = 0; + list_for_each_entry(msidesc, &dev->msi_list, list) { + irq = create_irq_nr(irq_want); + if (irq == 0) + return -1; + irq_want = irq + 1; +#ifdef CONFIG_INTR_REMAP + if (!intr_remapping_enabled) + goto no_ir; + + if (!sub_handle) { + /* + * allocate the consecutive block of IRTE's + * for 'nvec' + */ + index = msi_alloc_irte(dev, irq, nvec); + if (index < 0) { + ret = index; + goto error; + } + } else { + iommu = map_dev_to_ir(dev); + if (!iommu) { + ret = -ENOENT; + goto error; + } + /* + * setup the mapping between the irq and the IRTE + * base index, the sub_handle pointing to the + * appropriate interrupt remap table entry. + */ + set_irte_irq(irq, iommu, index, sub_handle); + } +no_ir: +#endif + ret = setup_msi_irq(dev, msidesc, irq); + if (ret < 0) + goto error; + sub_handle++; + } + return 0; + +error: + destroy_irq(irq); + return ret; +} + +void arch_teardown_msi_irq(unsigned int irq) +{ + destroy_irq(irq); +} + +#ifdef CONFIG_DMAR +#ifdef CONFIG_SMP +static void dmar_msi_set_affinity(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irq_cfg *cfg; + struct msi_msg msg; + unsigned int dest; + + dest = set_desc_affinity(desc, mask); + if (dest == BAD_APICID) + return; + + cfg = desc->chip_data; + + dmar_msi_read(irq, &msg); + + msg.data &= ~MSI_DATA_VECTOR_MASK; + msg.data |= MSI_DATA_VECTOR(cfg->vector); + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; + msg.address_lo |= MSI_ADDR_DEST_ID(dest); + + dmar_msi_write(irq, &msg); +} + +#endif /* CONFIG_SMP */ + +struct irq_chip dmar_msi_type = { + .name = "DMAR_MSI", + .unmask = dmar_msi_unmask, + .mask = dmar_msi_mask, + .ack = ack_apic_edge, +#ifdef CONFIG_SMP + .set_affinity = dmar_msi_set_affinity, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +int arch_setup_dmar_msi(unsigned int irq) +{ + int ret; + struct msi_msg msg; + + ret = msi_compose_msg(NULL, irq, &msg); + if (ret < 0) + return ret; + dmar_msi_write(irq, &msg); + set_irq_chip_and_handler_name(irq, &dmar_msi_type, handle_edge_irq, + "edge"); + return 0; +} +#endif + +#ifdef CONFIG_HPET_TIMER + +#ifdef CONFIG_SMP +static void hpet_msi_set_affinity(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irq_cfg *cfg; + struct msi_msg msg; + unsigned int dest; + + dest = set_desc_affinity(desc, mask); + if (dest == BAD_APICID) + return; + + cfg = desc->chip_data; + + hpet_msi_read(irq, &msg); + + msg.data &= ~MSI_DATA_VECTOR_MASK; + msg.data |= MSI_DATA_VECTOR(cfg->vector); + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; + msg.address_lo |= MSI_ADDR_DEST_ID(dest); + + hpet_msi_write(irq, &msg); +} + +#endif /* CONFIG_SMP */ + +struct irq_chip hpet_msi_type = { + .name = "HPET_MSI", + .unmask = hpet_msi_unmask, + .mask = hpet_msi_mask, + .ack = ack_apic_edge, +#ifdef CONFIG_SMP + .set_affinity = hpet_msi_set_affinity, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +int arch_setup_hpet_msi(unsigned int irq) +{ + int ret; + struct msi_msg msg; + + ret = msi_compose_msg(NULL, irq, &msg); + if (ret < 0) + return ret; + + hpet_msi_write(irq, &msg); + set_irq_chip_and_handler_name(irq, &hpet_msi_type, handle_edge_irq, + "edge"); + + return 0; +} +#endif + +#endif /* CONFIG_PCI_MSI */ +/* + * Hypertransport interrupt support + */ +#ifdef CONFIG_HT_IRQ + +#ifdef CONFIG_SMP + +static void target_ht_irq(unsigned int irq, unsigned int dest, u8 vector) +{ + struct ht_irq_msg msg; + fetch_ht_irq_msg(irq, &msg); + + msg.address_lo &= ~(HT_IRQ_LOW_VECTOR_MASK | HT_IRQ_LOW_DEST_ID_MASK); + msg.address_hi &= ~(HT_IRQ_HIGH_DEST_ID_MASK); + + msg.address_lo |= HT_IRQ_LOW_VECTOR(vector) | HT_IRQ_LOW_DEST_ID(dest); + msg.address_hi |= HT_IRQ_HIGH_DEST_ID(dest); + + write_ht_irq_msg(irq, &msg); +} + +static void set_ht_irq_affinity(unsigned int irq, const struct cpumask *mask) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irq_cfg *cfg; + unsigned int dest; + + dest = set_desc_affinity(desc, mask); + if (dest == BAD_APICID) + return; + + cfg = desc->chip_data; + + target_ht_irq(irq, dest, cfg->vector); +} + +#endif + +static struct irq_chip ht_irq_chip = { + .name = "PCI-HT", + .mask = mask_ht_irq, + .unmask = unmask_ht_irq, + .ack = ack_apic_edge, +#ifdef CONFIG_SMP + .set_affinity = set_ht_irq_affinity, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +int arch_setup_ht_irq(unsigned int irq, struct pci_dev *dev) +{ + struct irq_cfg *cfg; + int err; + + if (disable_apic) + return -ENXIO; + + cfg = irq_cfg(irq); + err = assign_irq_vector(irq, cfg, apic->target_cpus()); + if (!err) { + struct ht_irq_msg msg; + unsigned dest; + + dest = apic->cpu_mask_to_apicid_and(cfg->domain, + apic->target_cpus()); + + msg.address_hi = HT_IRQ_HIGH_DEST_ID(dest); + + msg.address_lo = + HT_IRQ_LOW_BASE | + HT_IRQ_LOW_DEST_ID(dest) | + HT_IRQ_LOW_VECTOR(cfg->vector) | + ((apic->irq_dest_mode == 0) ? + HT_IRQ_LOW_DM_PHYSICAL : + HT_IRQ_LOW_DM_LOGICAL) | + HT_IRQ_LOW_RQEOI_EDGE | + ((apic->irq_delivery_mode != dest_LowestPrio) ? + HT_IRQ_LOW_MT_FIXED : + HT_IRQ_LOW_MT_ARBITRATED) | + HT_IRQ_LOW_IRQ_MASKED; + + write_ht_irq_msg(irq, &msg); + + set_irq_chip_and_handler_name(irq, &ht_irq_chip, + handle_edge_irq, "edge"); + + dev_printk(KERN_DEBUG, &dev->dev, "irq %d for HT\n", irq); + } + return err; +} +#endif /* CONFIG_HT_IRQ */ + +#ifdef CONFIG_X86_UV +/* + * Re-target the irq to the specified CPU and enable the specified MMR located + * on the specified blade to allow the sending of MSIs to the specified CPU. + */ +int arch_enable_uv_irq(char *irq_name, unsigned int irq, int cpu, int mmr_blade, + unsigned long mmr_offset) +{ + const struct cpumask *eligible_cpu = cpumask_of(cpu); + struct irq_cfg *cfg; + int mmr_pnode; + unsigned long mmr_value; + struct uv_IO_APIC_route_entry *entry; + unsigned long flags; + int err; + + cfg = irq_cfg(irq); + + err = assign_irq_vector(irq, cfg, eligible_cpu); + if (err != 0) + return err; + + spin_lock_irqsave(&vector_lock, flags); + set_irq_chip_and_handler_name(irq, &uv_irq_chip, handle_percpu_irq, + irq_name); + spin_unlock_irqrestore(&vector_lock, flags); + + mmr_value = 0; + entry = (struct uv_IO_APIC_route_entry *)&mmr_value; + BUG_ON(sizeof(struct uv_IO_APIC_route_entry) != sizeof(unsigned long)); + + entry->vector = cfg->vector; + entry->delivery_mode = apic->irq_delivery_mode; + entry->dest_mode = apic->irq_dest_mode; + entry->polarity = 0; + entry->trigger = 0; + entry->mask = 0; + entry->dest = apic->cpu_mask_to_apicid(eligible_cpu); + + mmr_pnode = uv_blade_to_pnode(mmr_blade); + uv_write_global_mmr64(mmr_pnode, mmr_offset, mmr_value); + + return irq; +} + +/* + * Disable the specified MMR located on the specified blade so that MSIs are + * longer allowed to be sent. + */ +void arch_disable_uv_irq(int mmr_blade, unsigned long mmr_offset) +{ + unsigned long mmr_value; + struct uv_IO_APIC_route_entry *entry; + int mmr_pnode; + + mmr_value = 0; + entry = (struct uv_IO_APIC_route_entry *)&mmr_value; + BUG_ON(sizeof(struct uv_IO_APIC_route_entry) != sizeof(unsigned long)); + + entry->mask = 1; + + mmr_pnode = uv_blade_to_pnode(mmr_blade); + uv_write_global_mmr64(mmr_pnode, mmr_offset, mmr_value); +} +#endif /* CONFIG_X86_64 */ + +int __init io_apic_get_redir_entries (int ioapic) +{ + union IO_APIC_reg_01 reg_01; + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + reg_01.raw = io_apic_read(ioapic, 1); + spin_unlock_irqrestore(&ioapic_lock, flags); + + return reg_01.bits.entries; +} + +void __init probe_nr_irqs_gsi(void) +{ + int nr = 0; + + nr = acpi_probe_gsi(); + if (nr > nr_irqs_gsi) { + nr_irqs_gsi = nr; + } else { + /* for acpi=off or acpi is not compiled in */ + int idx; + + nr = 0; + for (idx = 0; idx < nr_ioapics; idx++) + nr += io_apic_get_redir_entries(idx) + 1; + + if (nr > nr_irqs_gsi) + nr_irqs_gsi = nr; + } + + printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi); +} + +#ifdef CONFIG_SPARSE_IRQ +int __init arch_probe_nr_irqs(void) +{ + int nr; + + if (nr_irqs > (NR_VECTORS * nr_cpu_ids)) + nr_irqs = NR_VECTORS * nr_cpu_ids; + + nr = nr_irqs_gsi + 8 * nr_cpu_ids; +#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ) + /* + * for MSI and HT dyn irq + */ + nr += nr_irqs_gsi * 16; +#endif + if (nr < nr_irqs) + nr_irqs = nr; + + return 0; +} +#endif + +/* -------------------------------------------------------------------------- + ACPI-based IOAPIC Configuration + -------------------------------------------------------------------------- */ + +#ifdef CONFIG_ACPI + +#ifdef CONFIG_X86_32 +int __init io_apic_get_unique_id(int ioapic, int apic_id) +{ + union IO_APIC_reg_00 reg_00; + static physid_mask_t apic_id_map = PHYSID_MASK_NONE; + physid_mask_t tmp; + unsigned long flags; + int i = 0; + + /* + * The P4 platform supports up to 256 APIC IDs on two separate APIC + * buses (one for LAPICs, one for IOAPICs), where predecessors only + * supports up to 16 on one shared APIC bus. + * + * TBD: Expand LAPIC/IOAPIC support on P4-class systems to take full + * advantage of new APIC bus architecture. + */ + + if (physids_empty(apic_id_map)) + apic_id_map = apic->ioapic_phys_id_map(phys_cpu_present_map); + + spin_lock_irqsave(&ioapic_lock, flags); + reg_00.raw = io_apic_read(ioapic, 0); + spin_unlock_irqrestore(&ioapic_lock, flags); + + if (apic_id >= get_physical_broadcast()) { + printk(KERN_WARNING "IOAPIC[%d]: Invalid apic_id %d, trying " + "%d\n", ioapic, apic_id, reg_00.bits.ID); + apic_id = reg_00.bits.ID; + } + + /* + * Every APIC in a system must have a unique ID or we get lots of nice + * 'stuck on smp_invalidate_needed IPI wait' messages. + */ + if (apic->check_apicid_used(apic_id_map, apic_id)) { + + for (i = 0; i < get_physical_broadcast(); i++) { + if (!apic->check_apicid_used(apic_id_map, i)) + break; + } + + if (i == get_physical_broadcast()) + panic("Max apic_id exceeded!\n"); + + printk(KERN_WARNING "IOAPIC[%d]: apic_id %d already used, " + "trying %d\n", ioapic, apic_id, i); + + apic_id = i; + } + + tmp = apic->apicid_to_cpu_present(apic_id); + physids_or(apic_id_map, apic_id_map, tmp); + + if (reg_00.bits.ID != apic_id) { + reg_00.bits.ID = apic_id; + + spin_lock_irqsave(&ioapic_lock, flags); + io_apic_write(ioapic, 0, reg_00.raw); + reg_00.raw = io_apic_read(ioapic, 0); + spin_unlock_irqrestore(&ioapic_lock, flags); + + /* Sanity check */ + if (reg_00.bits.ID != apic_id) { + printk("IOAPIC[%d]: Unable to change apic_id!\n", ioapic); + return -1; + } + } + + apic_printk(APIC_VERBOSE, KERN_INFO + "IOAPIC[%d]: Assigned apic_id %d\n", ioapic, apic_id); + + return apic_id; +} + +int __init io_apic_get_version(int ioapic) +{ + union IO_APIC_reg_01 reg_01; + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + reg_01.raw = io_apic_read(ioapic, 1); + spin_unlock_irqrestore(&ioapic_lock, flags); + + return reg_01.bits.version; +} +#endif + +int io_apic_set_pci_routing (int ioapic, int pin, int irq, int triggering, int polarity) +{ + struct irq_desc *desc; + struct irq_cfg *cfg; + int cpu = boot_cpu_id; + + if (!IO_APIC_IRQ(irq)) { + apic_printk(APIC_QUIET,KERN_ERR "IOAPIC[%d]: Invalid reference to IRQ 0\n", + ioapic); + return -EINVAL; + } + + desc = irq_to_desc_alloc_cpu(irq, cpu); + if (!desc) { + printk(KERN_INFO "can not get irq_desc %d\n", irq); + return 0; + } + + /* + * IRQs < 16 are already in the irq_2_pin[] map + */ + if (irq >= NR_IRQS_LEGACY) { + cfg = desc->chip_data; + add_pin_to_irq_cpu(cfg, cpu, ioapic, pin); + } + + setup_IO_APIC_irq(ioapic, pin, irq, desc, triggering, polarity); + + return 0; +} + + +int acpi_get_override_irq(int bus_irq, int *trigger, int *polarity) +{ + int i; + + if (skip_ioapic_setup) + return -1; + + for (i = 0; i < mp_irq_entries; i++) + if (mp_irqs[i].irqtype == mp_INT && + mp_irqs[i].srcbusirq == bus_irq) + break; + if (i >= mp_irq_entries) + return -1; + + *trigger = irq_trigger(i); + *polarity = irq_polarity(i); + return 0; +} + +#endif /* CONFIG_ACPI */ + +/* + * This function currently is only a helper for the i386 smp boot process where + * we need to reprogram the ioredtbls to cater for the cpus which have come online + * so mask in all cases should simply be apic->target_cpus() + */ +#ifdef CONFIG_SMP +void __init setup_ioapic_dest(void) +{ + int pin, ioapic, irq, irq_entry; + struct irq_desc *desc; + struct irq_cfg *cfg; + const struct cpumask *mask; + + if (skip_ioapic_setup == 1) + return; + + for (ioapic = 0; ioapic < nr_ioapics; ioapic++) { + for (pin = 0; pin < nr_ioapic_registers[ioapic]; pin++) { + irq_entry = find_irq_entry(ioapic, pin, mp_INT); + if (irq_entry == -1) + continue; + irq = pin_2_irq(irq_entry, ioapic, pin); + + /* setup_IO_APIC_irqs could fail to get vector for some device + * when you have too many devices, because at that time only boot + * cpu is online. + */ + desc = irq_to_desc(irq); + cfg = desc->chip_data; + if (!cfg->vector) { + setup_IO_APIC_irq(ioapic, pin, irq, desc, + irq_trigger(irq_entry), + irq_polarity(irq_entry)); + continue; + + } + + /* + * Honour affinities which have been set in early boot + */ + if (desc->status & + (IRQ_NO_BALANCING | IRQ_AFFINITY_SET)) + mask = desc->affinity; + else + mask = apic->target_cpus(); + +#ifdef CONFIG_INTR_REMAP + if (intr_remapping_enabled) + set_ir_ioapic_affinity_irq_desc(desc, mask); + else +#endif + set_ioapic_affinity_irq_desc(desc, mask); + } + + } +} +#endif + +#define IOAPIC_RESOURCE_NAME_SIZE 11 + +static struct resource *ioapic_resources; + +static struct resource * __init ioapic_setup_resources(void) +{ + unsigned long n; + struct resource *res; + char *mem; + int i; + + if (nr_ioapics <= 0) + return NULL; + + n = IOAPIC_RESOURCE_NAME_SIZE + sizeof(struct resource); + n *= nr_ioapics; + + mem = alloc_bootmem(n); + res = (void *)mem; + + if (mem != NULL) { + mem += sizeof(struct resource) * nr_ioapics; + + for (i = 0; i < nr_ioapics; i++) { + res[i].name = mem; + res[i].flags = IORESOURCE_MEM | IORESOURCE_BUSY; + sprintf(mem, "IOAPIC %u", i); + mem += IOAPIC_RESOURCE_NAME_SIZE; + } + } + + ioapic_resources = res; + + return res; +} + +void __init ioapic_init_mappings(void) +{ + unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; + struct resource *ioapic_res; + int i; + + ioapic_res = ioapic_setup_resources(); + for (i = 0; i < nr_ioapics; i++) { + if (smp_found_config) { + ioapic_phys = mp_ioapics[i].apicaddr; +#ifdef CONFIG_X86_32 + if (!ioapic_phys) { + printk(KERN_ERR + "WARNING: bogus zero IO-APIC " + "address found in MPTABLE, " + "disabling IO/APIC support!\n"); + smp_found_config = 0; + skip_ioapic_setup = 1; + goto fake_ioapic_page; + } +#endif + } else { +#ifdef CONFIG_X86_32 +fake_ioapic_page: +#endif + ioapic_phys = (unsigned long) + alloc_bootmem_pages(PAGE_SIZE); + ioapic_phys = __pa(ioapic_phys); + } + set_fixmap_nocache(idx, ioapic_phys); + apic_printk(APIC_VERBOSE, + "mapped IOAPIC to %08lx (%08lx)\n", + __fix_to_virt(idx), ioapic_phys); + idx++; + + if (ioapic_res != NULL) { + ioapic_res->start = ioapic_phys; + ioapic_res->end = ioapic_phys + (4 * 1024) - 1; + ioapic_res++; + } + } +} + +static int __init ioapic_insert_resources(void) +{ + int i; + struct resource *r = ioapic_resources; + + if (!r) { + printk(KERN_ERR + "IO APIC resources could be not be allocated.\n"); + return -1; + } + + for (i = 0; i < nr_ioapics; i++) { + insert_resource(&iomem_resource, r); + r++; + } + + return 0; +} + +/* Insert the IO APIC resources after PCI initialization has occured to handle + * IO APICS that are mapped in on a BAR in PCI space. */ +late_initcall(ioapic_insert_resources); Index: linux-2.6-tip/arch/x86/kernel/apic/ipi.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/ipi.c @@ -0,0 +1,164 @@ +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +void default_send_IPI_mask_sequence_phys(const struct cpumask *mask, int vector) +{ + unsigned long query_cpu; + unsigned long flags; + + /* + * Hack. The clustered APIC addressing mode doesn't allow us to send + * to an arbitrary mask, so I do a unicast to each CPU instead. + * - mbligh + */ + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + __default_send_IPI_dest_field(per_cpu(x86_cpu_to_apicid, + query_cpu), vector, APIC_DEST_PHYSICAL); + } + local_irq_restore(flags); +} + +void default_send_IPI_mask_allbutself_phys(const struct cpumask *mask, + int vector) +{ + unsigned int this_cpu = smp_processor_id(); + unsigned int query_cpu; + unsigned long flags; + + /* See Hack comment above */ + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + if (query_cpu == this_cpu) + continue; + __default_send_IPI_dest_field(per_cpu(x86_cpu_to_apicid, + query_cpu), vector, APIC_DEST_PHYSICAL); + } + local_irq_restore(flags); +} + +void default_send_IPI_mask_sequence_logical(const struct cpumask *mask, + int vector) +{ + unsigned long flags; + unsigned int query_cpu; + + /* + * Hack. The clustered APIC addressing mode doesn't allow us to send + * to an arbitrary mask, so I do a unicasts to each CPU instead. This + * should be modified to do 1 message per cluster ID - mbligh + */ + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) + __default_send_IPI_dest_field( + apic->cpu_to_logical_apicid(query_cpu), vector, + apic->dest_logical); + local_irq_restore(flags); +} + +void default_send_IPI_mask_allbutself_logical(const struct cpumask *mask, + int vector) +{ + unsigned long flags; + unsigned int query_cpu; + unsigned int this_cpu = smp_processor_id(); + + /* See Hack comment above */ + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + if (query_cpu == this_cpu) + continue; + __default_send_IPI_dest_field( + apic->cpu_to_logical_apicid(query_cpu), vector, + apic->dest_logical); + } + local_irq_restore(flags); +} + +#ifdef CONFIG_X86_32 + +/* + * This is only used on smaller machines. + */ +void default_send_IPI_mask_logical(const struct cpumask *cpumask, int vector) +{ + unsigned long mask = cpumask_bits(cpumask)[0]; + unsigned long flags; + + local_irq_save(flags); + WARN_ON(mask & ~cpumask_bits(cpu_online_mask)[0]); + __default_send_IPI_dest_field(mask, vector, apic->dest_logical); + local_irq_restore(flags); +} + +void default_send_IPI_allbutself(int vector) +{ + /* + * if there are no other CPUs in the system then we get an APIC send + * error if we try to broadcast, thus avoid sending IPIs in this case. + */ + if (!(num_online_cpus() > 1)) + return; + + __default_local_send_IPI_allbutself(vector); +} + +void default_send_IPI_all(int vector) +{ + __default_local_send_IPI_all(vector); +} + +void default_send_IPI_self(int vector) +{ + __default_send_IPI_shortcut(APIC_DEST_SELF, vector, apic->dest_logical); +} + +/* must come after the send_IPI functions above for inlining */ +static int convert_apicid_to_cpu(int apic_id) +{ + int i; + + for_each_possible_cpu(i) { + if (per_cpu(x86_cpu_to_apicid, i) == apic_id) + return i; + } + return -1; +} + +int safe_smp_processor_id(void) +{ + int apicid, cpuid; + + if (!boot_cpu_has(X86_FEATURE_APIC)) + return 0; + + apicid = hard_smp_processor_id(); + if (apicid == BAD_APICID) + return 0; + + cpuid = convert_apicid_to_cpu(apicid); + + return cpuid >= 0 ? cpuid : 0; +} +#endif Index: linux-2.6-tip/arch/x86/kernel/apic/nmi.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/nmi.c @@ -0,0 +1,566 @@ +/* + * NMI watchdog support on APIC systems + * + * Started by Ingo Molnar + * + * Fixes: + * Mikael Pettersson : AMD K7 support for local APIC NMI watchdog. + * Mikael Pettersson : Power Management for local APIC NMI watchdog. + * Mikael Pettersson : Pentium 4 support for local APIC NMI watchdog. + * Pavel Machek and + * Mikael Pettersson : PM converted to driver model. Disable/enable API. + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include + +#include + +int unknown_nmi_panic; +int nmi_watchdog_enabled; + +static cpumask_t backtrace_mask = CPU_MASK_NONE; + +/* nmi_active: + * >0: the lapic NMI watchdog is active, but can be disabled + * <0: the lapic NMI watchdog has not been set up, and cannot + * be enabled + * 0: the lapic NMI watchdog is disabled, but can be enabled + */ +atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ +EXPORT_SYMBOL(nmi_active); + +unsigned int nmi_watchdog = NMI_NONE; +EXPORT_SYMBOL(nmi_watchdog); + +static int panic_on_timeout; + +static unsigned int nmi_hz = HZ; +static DEFINE_PER_CPU(short, wd_enabled); +static int endflag __initdata; + +static inline unsigned int get_nmi_count(int cpu) +{ + return per_cpu(irq_stat, cpu).__nmi_count; +} + +static inline int mce_in_progress(void) +{ +#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE) + return atomic_read(&mce_entry) > 0; +#endif + return 0; +} + +/* + * Take the local apic timer and PIT/HPET into account. We don't + * know which one is active, when we have highres/dyntick on + */ +static inline unsigned int get_timer_irqs(int cpu) +{ + return per_cpu(irq_stat, cpu).apic_timer_irqs + + per_cpu(irq_stat, cpu).irq0_irqs; +} + +#ifdef CONFIG_SMP +/* + * The performance counters used by NMI_LOCAL_APIC don't trigger when + * the CPU is idle. To make sure the NMI watchdog really ticks on all + * CPUs during the test make them busy. + */ +static __init void nmi_cpu_busy(void *data) +{ +#ifndef CONFIG_PREEMPT_RT + local_irq_enable_in_hardirq(); +#endif + /* + * Intentionally don't use cpu_relax here. This is + * to make sure that the performance counter really ticks, + * even if there is a simulator or similar that catches the + * pause instruction. On a real HT machine this is fine because + * all other CPUs are busy with "useless" delay loops and don't + * care if they get somewhat less cycles. + */ + while (endflag == 0) + mb(); +} +#endif + +static void report_broken_nmi(int cpu, int *prev_nmi_count) +{ + printk(KERN_CONT "\n"); + + printk(KERN_WARNING + "WARNING: CPU#%d: NMI appears to be stuck (%d->%d)!\n", + cpu, prev_nmi_count[cpu], get_nmi_count(cpu)); + + printk(KERN_WARNING + "Please report this to bugzilla.kernel.org,\n"); + printk(KERN_WARNING + "and attach the output of the 'dmesg' command.\n"); + + per_cpu(wd_enabled, cpu) = 0; + atomic_dec(&nmi_active); +} + +static void __acpi_nmi_disable(void *__unused) +{ + apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); +} + +int __init check_nmi_watchdog(void) +{ + unsigned int *prev_nmi_count; + int cpu; + + if (!nmi_watchdog_active() || !atomic_read(&nmi_active)) + return 0; + + prev_nmi_count = kmalloc(nr_cpu_ids * sizeof(int), GFP_KERNEL); + if (!prev_nmi_count) + goto error; + + printk(KERN_INFO "Testing NMI watchdog ... "); + +#ifdef CONFIG_SMP + if (nmi_watchdog == NMI_LOCAL_APIC) + smp_call_function(nmi_cpu_busy, (void *)&endflag, 0); +#endif + + for_each_possible_cpu(cpu) + prev_nmi_count[cpu] = get_nmi_count(cpu); + local_irq_enable(); + mdelay((20 * 1000) / nmi_hz); /* wait 20 ticks */ + + for_each_online_cpu(cpu) { + if (!per_cpu(wd_enabled, cpu)) + continue; + if (get_nmi_count(cpu) - prev_nmi_count[cpu] <= 5) + report_broken_nmi(cpu, prev_nmi_count); + } + endflag = 1; + if (!atomic_read(&nmi_active)) { + kfree(prev_nmi_count); + atomic_set(&nmi_active, -1); + goto error; + } + printk("OK.\n"); + + /* + * now that we know it works we can reduce NMI frequency to + * something more reasonable; makes a difference in some configs + */ + if (nmi_watchdog == NMI_LOCAL_APIC) + nmi_hz = lapic_adjust_nmi_hz(1); + + kfree(prev_nmi_count); + return 0; +error: + if (nmi_watchdog == NMI_IO_APIC) { + if (!timer_through_8259) + disable_8259A_irq(0); + on_each_cpu(__acpi_nmi_disable, NULL, 1); + } + +#ifdef CONFIG_X86_32 + timer_ack = 0; +#endif + return -1; +} + +static int __init setup_nmi_watchdog(char *str) +{ + unsigned int nmi; + + if (!strncmp(str, "panic", 5)) { + panic_on_timeout = 1; + str = strchr(str, ','); + if (!str) + return 1; + ++str; + } + + if (!strncmp(str, "lapic", 5)) + nmi_watchdog = NMI_LOCAL_APIC; + else if (!strncmp(str, "ioapic", 6)) + nmi_watchdog = NMI_IO_APIC; + else { + get_option(&str, &nmi); + if (nmi >= NMI_INVALID) + return 0; + nmi_watchdog = nmi; + } + + return 1; +} +__setup("nmi_watchdog=", setup_nmi_watchdog); + +/* + * Suspend/resume support + */ +#ifdef CONFIG_PM + +static int nmi_pm_active; /* nmi_active before suspend */ + +static int lapic_nmi_suspend(struct sys_device *dev, pm_message_t state) +{ + /* only CPU0 goes here, other CPUs should be offline */ + nmi_pm_active = atomic_read(&nmi_active); + stop_apic_nmi_watchdog(NULL); + BUG_ON(atomic_read(&nmi_active) != 0); + return 0; +} + +static int lapic_nmi_resume(struct sys_device *dev) +{ + /* only CPU0 goes here, other CPUs should be offline */ + if (nmi_pm_active > 0) { + setup_apic_nmi_watchdog(NULL); + touch_nmi_watchdog(); + } + return 0; +} + +static struct sysdev_class nmi_sysclass = { + .name = "lapic_nmi", + .resume = lapic_nmi_resume, + .suspend = lapic_nmi_suspend, +}; + +static struct sys_device device_lapic_nmi = { + .id = 0, + .cls = &nmi_sysclass, +}; + +static int __init init_lapic_nmi_sysfs(void) +{ + int error; + + /* + * should really be a BUG_ON but b/c this is an + * init call, it just doesn't work. -dcz + */ + if (nmi_watchdog != NMI_LOCAL_APIC) + return 0; + + if (atomic_read(&nmi_active) < 0) + return 0; + + error = sysdev_class_register(&nmi_sysclass); + if (!error) + error = sysdev_register(&device_lapic_nmi); + return error; +} + +/* must come after the local APIC's device_initcall() */ +late_initcall(init_lapic_nmi_sysfs); + +#endif /* CONFIG_PM */ + +static void __acpi_nmi_enable(void *__unused) +{ + apic_write(APIC_LVT0, APIC_DM_NMI); +} + +/* + * Enable timer based NMIs on all CPUs: + */ +void acpi_nmi_enable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_enable, NULL, 1); +} + +/* + * Disable timer based NMIs on all CPUs: + */ +void acpi_nmi_disable(void) +{ + if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) + on_each_cpu(__acpi_nmi_disable, NULL, 1); +} + +/* + * This function is called as soon the LAPIC NMI watchdog driver has everything + * in place and it's ready to check if the NMIs belong to the NMI watchdog + */ +void cpu_nmi_set_wd_enabled(void) +{ + __get_cpu_var(wd_enabled) = 1; +} + +void setup_apic_nmi_watchdog(void *unused) +{ + if (__get_cpu_var(wd_enabled)) + return; + + /* cheap hack to support suspend/resume */ + /* if cpu0 is not active neither should the other cpus */ + if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0) + return; + + switch (nmi_watchdog) { + case NMI_LOCAL_APIC: + if (lapic_watchdog_init(nmi_hz) < 0) { + __get_cpu_var(wd_enabled) = 0; + return; + } + /* FALL THROUGH */ + case NMI_IO_APIC: + __get_cpu_var(wd_enabled) = 1; + atomic_inc(&nmi_active); + } +} + +void stop_apic_nmi_watchdog(void *unused) +{ + /* only support LOCAL and IO APICs for now */ + if (!nmi_watchdog_active()) + return; + if (__get_cpu_var(wd_enabled) == 0) + return; + if (nmi_watchdog == NMI_LOCAL_APIC) + lapic_watchdog_stop(); + else + __acpi_nmi_disable(NULL); + __get_cpu_var(wd_enabled) = 0; + atomic_dec(&nmi_active); +} + +/* + * the best way to detect whether a CPU has a 'hard lockup' problem + * is to check it's local APIC timer IRQ counts. If they are not + * changing then that CPU has some problem. + * + * as these watchdog NMI IRQs are generated on every CPU, we only + * have to check the current processor. + * + * since NMIs don't listen to _any_ locks, we have to be extremely + * careful not to rely on unsafe variables. The printk might lock + * up though, so we have to break up any console locks first ... + * [when there will be more tty-related locks, break them up here too!] + */ + +static DEFINE_PER_CPU(unsigned, last_irq_sum); +static DEFINE_PER_CPU(local_t, alert_counter); +static DEFINE_PER_CPU(int, nmi_touch); + +void touch_nmi_watchdog(void) +{ + if (nmi_watchdog_active()) { + unsigned cpu; + + /* + * Tell other CPUs to reset their alert counters. We cannot + * do it ourselves because the alert count increase is not + * atomic. + */ + for_each_present_cpu(cpu) { + if (per_cpu(nmi_touch, cpu) != 1) + per_cpu(nmi_touch, cpu) = 1; + } + } + + /* + * Tickle the softlockup detector too: + */ + touch_softlockup_watchdog(); +} +EXPORT_SYMBOL(touch_nmi_watchdog); + +notrace __kprobes int +nmi_watchdog_tick(struct pt_regs *regs, unsigned reason) +{ + /* + * Since current_thread_info()-> is always on the stack, and we + * always switch the stack NMI-atomically, it's safe to use + * smp_processor_id(). + */ + unsigned int sum; + int touched = 0; + int cpu = smp_processor_id(); + int rc = 0; + + /* check for other users first */ + if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) + == NOTIFY_STOP) { + rc = 1; + touched = 1; + } + + sum = get_timer_irqs(cpu); + + if (__get_cpu_var(nmi_touch)) { + __get_cpu_var(nmi_touch) = 0; + touched = 1; + } + + if (cpu_isset(cpu, backtrace_mask)) { + static DEFINE_RAW_SPINLOCK(lock); /* Serialise the printks */ + + spin_lock(&lock); + printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu); + dump_stack(); + spin_unlock(&lock); + cpu_clear(cpu, backtrace_mask); + } + + /* Could check oops_in_progress here too, but it's safer not to */ + if (mce_in_progress()) + touched = 1; + + /* if the none of the timers isn't firing, this cpu isn't doing much */ + if (!touched && __get_cpu_var(last_irq_sum) == sum) { + /* + * Ayiee, looks like this CPU is stuck ... + * wait a few IRQs (5 seconds) before doing the oops ... + */ + local_inc(&__get_cpu_var(alert_counter)); + if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz) + /* + * die_nmi will return ONLY if NOTIFY_STOP happens.. + */ + die_nmi("BUG: NMI Watchdog detected LOCKUP", + regs, panic_on_timeout); + } else { + __get_cpu_var(last_irq_sum) = sum; + local_set(&__get_cpu_var(alert_counter), 0); + } + + /* see if the nmi watchdog went off */ + if (!__get_cpu_var(wd_enabled)) + return rc; + switch (nmi_watchdog) { + case NMI_LOCAL_APIC: + rc |= lapic_wd_event(nmi_hz); + break; + case NMI_IO_APIC: + /* + * don't know how to accurately check for this. + * just assume it was a watchdog timer interrupt + * This matches the old behaviour. + */ + rc = 1; + break; + } + return rc; +} + +#ifdef CONFIG_SYSCTL + +static void enable_ioapic_nmi_watchdog_single(void *unused) +{ + __get_cpu_var(wd_enabled) = 1; + atomic_inc(&nmi_active); + __acpi_nmi_enable(NULL); +} + +static void enable_ioapic_nmi_watchdog(void) +{ + on_each_cpu(enable_ioapic_nmi_watchdog_single, NULL, 1); + touch_nmi_watchdog(); +} + +static void disable_ioapic_nmi_watchdog(void) +{ + on_each_cpu(stop_apic_nmi_watchdog, NULL, 1); +} + +static int __init setup_unknown_nmi_panic(char *str) +{ + unknown_nmi_panic = 1; + return 1; +} +__setup("unknown_nmi_panic", setup_unknown_nmi_panic); + +static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) +{ + unsigned char reason = get_nmi_reason(); + char buf[64]; + + sprintf(buf, "NMI received for unknown reason %02x\n", reason); + die_nmi(buf, regs, 1); /* Always panic here */ + return 0; +} + +/* + * proc handler for /proc/sys/kernel/nmi + */ +int proc_nmi_enabled(struct ctl_table *table, int write, struct file *file, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int old_state; + + nmi_watchdog_enabled = (atomic_read(&nmi_active) > 0) ? 1 : 0; + old_state = nmi_watchdog_enabled; + proc_dointvec(table, write, file, buffer, length, ppos); + if (!!old_state == !!nmi_watchdog_enabled) + return 0; + + if (atomic_read(&nmi_active) < 0 || !nmi_watchdog_active()) { + printk(KERN_WARNING + "NMI watchdog is permanently disabled\n"); + return -EIO; + } + + if (nmi_watchdog == NMI_LOCAL_APIC) { + if (nmi_watchdog_enabled) + enable_lapic_nmi_watchdog(); + else + disable_lapic_nmi_watchdog(); + } else if (nmi_watchdog == NMI_IO_APIC) { + if (nmi_watchdog_enabled) + enable_ioapic_nmi_watchdog(); + else + disable_ioapic_nmi_watchdog(); + } else { + printk(KERN_WARNING + "NMI watchdog doesn't know what hardware to touch\n"); + return -EIO; + } + return 0; +} + +#endif /* CONFIG_SYSCTL */ + +int do_nmi_callback(struct pt_regs *regs, int cpu) +{ +#ifdef CONFIG_SYSCTL + if (unknown_nmi_panic) + return unknown_nmi_panic_callback(regs, cpu); +#endif + return 0; +} + +void __trigger_all_cpu_backtrace(void) +{ + int i; + + backtrace_mask = cpu_online_map; + /* Wait for up to 10 seconds for all CPUs to do the backtrace */ + for (i = 0; i < 10 * 1000; i++) { + if (cpus_empty(backtrace_mask)) + break; + mdelay(1); + } +} Index: linux-2.6-tip/arch/x86/kernel/apic/numaq_32.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/numaq_32.c @@ -0,0 +1,565 @@ +/* + * Written by: Patricia Gaughen, IBM Corporation + * + * Copyright (C) 2002, IBM Corp. + * Copyright (C) 2009, Red Hat, Inc., Ingo Molnar + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#define MB_TO_PAGES(addr) ((addr) << (20 - PAGE_SHIFT)) + +int found_numaq; + +/* + * Have to match translation table entries to main table entries by counter + * hence the mpc_record variable .... can't see a less disgusting way of + * doing this .... + */ +struct mpc_trans { + unsigned char mpc_type; + unsigned char trans_len; + unsigned char trans_type; + unsigned char trans_quad; + unsigned char trans_global; + unsigned char trans_local; + unsigned short trans_reserved; +}; + +/* x86_quirks member */ +static int mpc_record; + +static __cpuinitdata struct mpc_trans *translation_table[MAX_MPC_ENTRY]; + +int mp_bus_id_to_node[MAX_MP_BUSSES]; +int mp_bus_id_to_local[MAX_MP_BUSSES]; +int quad_local_to_mp_bus_id[NR_CPUS/4][4]; + + +static inline void numaq_register_node(int node, struct sys_cfg_data *scd) +{ + struct eachquadmem *eq = scd->eq + node; + + node_set_online(node); + + /* Convert to pages */ + node_start_pfn[node] = + MB_TO_PAGES(eq->hi_shrd_mem_start - eq->priv_mem_size); + + node_end_pfn[node] = + MB_TO_PAGES(eq->hi_shrd_mem_start + eq->hi_shrd_mem_size); + + e820_register_active_regions(node, node_start_pfn[node], + node_end_pfn[node]); + + memory_present(node, node_start_pfn[node], node_end_pfn[node]); + + node_remap_size[node] = node_memmap_size_bytes(node, + node_start_pfn[node], + node_end_pfn[node]); +} + +/* + * Function: smp_dump_qct() + * + * Description: gets memory layout from the quad config table. This + * function also updates node_online_map with the nodes (quads) present. + */ +static void __init smp_dump_qct(void) +{ + struct sys_cfg_data *scd; + int node; + + scd = (void *)__va(SYS_CFG_DATA_PRIV_ADDR); + + nodes_clear(node_online_map); + for_each_node(node) { + if (scd->quads_present31_0 & (1 << node)) + numaq_register_node(node, scd); + } +} + +void __cpuinit numaq_tsc_disable(void) +{ + if (!found_numaq) + return; + + if (num_online_nodes() > 1) { + printk(KERN_DEBUG "NUMAQ: disabling TSC\n"); + setup_clear_cpu_cap(X86_FEATURE_TSC); + } +} + +static int __init numaq_pre_time_init(void) +{ + numaq_tsc_disable(); + return 0; +} + +static inline int generate_logical_apicid(int quad, int phys_apicid) +{ + return (quad << 4) + (phys_apicid ? phys_apicid << 1 : 1); +} + +/* x86_quirks member */ +static int mpc_apic_id(struct mpc_cpu *m) +{ + int quad = translation_table[mpc_record]->trans_quad; + int logical_apicid = generate_logical_apicid(quad, m->apicid); + + printk(KERN_DEBUG + "Processor #%d %u:%u APIC version %d (quad %d, apic %d)\n", + m->apicid, (m->cpufeature & CPU_FAMILY_MASK) >> 8, + (m->cpufeature & CPU_MODEL_MASK) >> 4, + m->apicver, quad, logical_apicid); + + return logical_apicid; +} + +/* x86_quirks member */ +static void mpc_oem_bus_info(struct mpc_bus *m, char *name) +{ + int quad = translation_table[mpc_record]->trans_quad; + int local = translation_table[mpc_record]->trans_local; + + mp_bus_id_to_node[m->busid] = quad; + mp_bus_id_to_local[m->busid] = local; + + printk(KERN_INFO "Bus #%d is %s (node %d)\n", m->busid, name, quad); +} + +/* x86_quirks member */ +static void mpc_oem_pci_bus(struct mpc_bus *m) +{ + int quad = translation_table[mpc_record]->trans_quad; + int local = translation_table[mpc_record]->trans_local; + + quad_local_to_mp_bus_id[quad][local] = m->busid; +} + +static void __init MP_translation_info(struct mpc_trans *m) +{ + printk(KERN_INFO + "Translation: record %d, type %d, quad %d, global %d, local %d\n", + mpc_record, m->trans_type, m->trans_quad, m->trans_global, + m->trans_local); + + if (mpc_record >= MAX_MPC_ENTRY) + printk(KERN_ERR "MAX_MPC_ENTRY exceeded!\n"); + else + translation_table[mpc_record] = m; /* stash this for later */ + + if (m->trans_quad < MAX_NUMNODES && !node_online(m->trans_quad)) + node_set_online(m->trans_quad); +} + +static int __init mpf_checksum(unsigned char *mp, int len) +{ + int sum = 0; + + while (len--) + sum += *mp++; + + return sum & 0xFF; +} + +/* + * Read/parse the MPC oem tables + */ +static void __init + smp_read_mpc_oem(struct mpc_oemtable *oemtable, unsigned short oemsize) +{ + int count = sizeof(*oemtable); /* the header size */ + unsigned char *oemptr = ((unsigned char *)oemtable) + count; + + mpc_record = 0; + printk(KERN_INFO + "Found an OEM MPC table at %8p - parsing it ... \n", oemtable); + + if (memcmp(oemtable->signature, MPC_OEM_SIGNATURE, 4)) { + printk(KERN_WARNING + "SMP mpc oemtable: bad signature [%c%c%c%c]!\n", + oemtable->signature[0], oemtable->signature[1], + oemtable->signature[2], oemtable->signature[3]); + return; + } + + if (mpf_checksum((unsigned char *)oemtable, oemtable->length)) { + printk(KERN_WARNING "SMP oem mptable: checksum error!\n"); + return; + } + + while (count < oemtable->length) { + switch (*oemptr) { + case MP_TRANSLATION: + { + struct mpc_trans *m = (void *)oemptr; + + MP_translation_info(m); + oemptr += sizeof(*m); + count += sizeof(*m); + ++mpc_record; + break; + } + default: + printk(KERN_WARNING + "Unrecognised OEM table entry type! - %d\n", + (int)*oemptr); + return; + } + } +} + +static int __init numaq_setup_ioapic_ids(void) +{ + /* so can skip it */ + return 1; +} + +static int __init numaq_update_apic(void) +{ + apic->wakeup_cpu = wakeup_secondary_cpu_via_nmi; + + return 0; +} + +static struct x86_quirks numaq_x86_quirks __initdata = { + .arch_pre_time_init = numaq_pre_time_init, + .arch_time_init = NULL, + .arch_pre_intr_init = NULL, + .arch_memory_setup = NULL, + .arch_intr_init = NULL, + .arch_trap_init = NULL, + .mach_get_smp_config = NULL, + .mach_find_smp_config = NULL, + .mpc_record = &mpc_record, + .mpc_apic_id = mpc_apic_id, + .mpc_oem_bus_info = mpc_oem_bus_info, + .mpc_oem_pci_bus = mpc_oem_pci_bus, + .smp_read_mpc_oem = smp_read_mpc_oem, + .setup_ioapic_ids = numaq_setup_ioapic_ids, + .update_apic = numaq_update_apic, +}; + +static __init void early_check_numaq(void) +{ + /* + * Find possible boot-time SMP configuration: + */ + early_find_smp_config(); + + /* + * get boot-time SMP configuration: + */ + if (smp_found_config) + early_get_smp_config(); + + if (found_numaq) + x86_quirks = &numaq_x86_quirks; +} + +int __init get_memcfg_numaq(void) +{ + early_check_numaq(); + if (!found_numaq) + return 0; + smp_dump_qct(); + + return 1; +} + +#define NUMAQ_APIC_DFR_VALUE (APIC_DFR_CLUSTER) + +static inline unsigned int numaq_get_apic_id(unsigned long x) +{ + return (x >> 24) & 0x0F; +} + +static inline void numaq_send_IPI_mask(const struct cpumask *mask, int vector) +{ + default_send_IPI_mask_sequence_logical(mask, vector); +} + +static inline void numaq_send_IPI_allbutself(int vector) +{ + default_send_IPI_mask_allbutself_logical(cpu_online_mask, vector); +} + +static inline void numaq_send_IPI_all(int vector) +{ + numaq_send_IPI_mask(cpu_online_mask, vector); +} + +#define NUMAQ_TRAMPOLINE_PHYS_LOW (0x8) +#define NUMAQ_TRAMPOLINE_PHYS_HIGH (0xa) + +/* + * Because we use NMIs rather than the INIT-STARTUP sequence to + * bootstrap the CPUs, the APIC may be in a weird state. Kick it: + */ +static inline void numaq_smp_callin_clear_local_apic(void) +{ + clear_local_APIC(); +} + +static inline const cpumask_t *numaq_target_cpus(void) +{ + return &CPU_MASK_ALL; +} + +static inline unsigned long +numaq_check_apicid_used(physid_mask_t bitmap, int apicid) +{ + return physid_isset(apicid, bitmap); +} + +static inline unsigned long numaq_check_apicid_present(int bit) +{ + return physid_isset(bit, phys_cpu_present_map); +} + +static inline int numaq_apic_id_registered(void) +{ + return 1; +} + +static inline void numaq_init_apic_ldr(void) +{ + /* Already done in NUMA-Q firmware */ +} + +static inline void numaq_setup_apic_routing(void) +{ + printk(KERN_INFO + "Enabling APIC mode: NUMA-Q. Using %d I/O APICs\n", + nr_ioapics); +} + +/* + * Skip adding the timer int on secondary nodes, which causes + * a small but painful rift in the time-space continuum. + */ +static inline int numaq_multi_timer_check(int apic, int irq) +{ + return apic != 0 && irq == 0; +} + +static inline physid_mask_t numaq_ioapic_phys_id_map(physid_mask_t phys_map) +{ + /* We don't have a good way to do this yet - hack */ + return physids_promote(0xFUL); +} + +static inline int numaq_cpu_to_logical_apicid(int cpu) +{ + if (cpu >= nr_cpu_ids) + return BAD_APICID; + return cpu_2_logical_apicid[cpu]; +} + +/* + * Supporting over 60 cpus on NUMA-Q requires a locality-dependent + * cpu to APIC ID relation to properly interact with the intelligent + * mode of the cluster controller. + */ +static inline int numaq_cpu_present_to_apicid(int mps_cpu) +{ + if (mps_cpu < 60) + return ((mps_cpu >> 2) << 4) | (1 << (mps_cpu & 0x3)); + else + return BAD_APICID; +} + +static inline int numaq_apicid_to_node(int logical_apicid) +{ + return logical_apicid >> 4; +} + +static inline physid_mask_t numaq_apicid_to_cpu_present(int logical_apicid) +{ + int node = numaq_apicid_to_node(logical_apicid); + int cpu = __ffs(logical_apicid & 0xf); + + return physid_mask_of_physid(cpu + 4*node); +} + +/* Where the IO area was mapped on multiquad, always 0 otherwise */ +void *xquad_portio; + +static inline int numaq_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return 1; +} + +/* + * We use physical apicids here, not logical, so just return the default + * physical broadcast to stop people from breaking us + */ +static inline unsigned int numaq_cpu_mask_to_apicid(const cpumask_t *cpumask) +{ + return 0x0F; +} + +static inline unsigned int +numaq_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + return 0x0F; +} + +/* No NUMA-Q box has a HT CPU, but it can't hurt to use the default code. */ +static inline int numaq_phys_pkg_id(int cpuid_apic, int index_msb) +{ + return cpuid_apic >> index_msb; +} + +static int +numaq_mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) +{ + if (strncmp(oem, "IBM NUMA", 8)) + printk(KERN_ERR "Warning! Not a NUMA-Q system!\n"); + else + found_numaq = 1; + + return found_numaq; +} + +static int probe_numaq(void) +{ + /* already know from get_memcfg_numaq() */ + return found_numaq; +} + +static void numaq_vector_allocation_domain(int cpu, cpumask_t *retmask) +{ + /* Careful. Some cpus do not strictly honor the set of cpus + * specified in the interrupt destination when using lowest + * priority interrupt delivery mode. + * + * In particular there was a hyperthreading cpu observed to + * deliver interrupts to the wrong hyperthread when only one + * hyperthread was specified in the interrupt desitination. + */ + *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; +} + +static void numaq_setup_portio_remap(void) +{ + int num_quads = num_online_nodes(); + + if (num_quads <= 1) + return; + + printk(KERN_INFO + "Remapping cross-quad port I/O for %d quads\n", num_quads); + + xquad_portio = ioremap(XQUAD_PORTIO_BASE, num_quads*XQUAD_PORTIO_QUAD); + + printk(KERN_INFO + "xquad_portio vaddr 0x%08lx, len %08lx\n", + (u_long) xquad_portio, (u_long) num_quads*XQUAD_PORTIO_QUAD); +} + +struct apic apic_numaq = { + + .name = "NUMAQ", + .probe = probe_numaq, + .acpi_madt_oem_check = NULL, + .apic_id_registered = numaq_apic_id_registered, + + .irq_delivery_mode = dest_LowestPrio, + /* physical delivery on LOCAL quad: */ + .irq_dest_mode = 0, + + .target_cpus = numaq_target_cpus, + .disable_esr = 1, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = numaq_check_apicid_used, + .check_apicid_present = numaq_check_apicid_present, + + .vector_allocation_domain = numaq_vector_allocation_domain, + .init_apic_ldr = numaq_init_apic_ldr, + + .ioapic_phys_id_map = numaq_ioapic_phys_id_map, + .setup_apic_routing = numaq_setup_apic_routing, + .multi_timer_check = numaq_multi_timer_check, + .apicid_to_node = numaq_apicid_to_node, + .cpu_to_logical_apicid = numaq_cpu_to_logical_apicid, + .cpu_present_to_apicid = numaq_cpu_present_to_apicid, + .apicid_to_cpu_present = numaq_apicid_to_cpu_present, + .setup_portio_remap = numaq_setup_portio_remap, + .check_phys_apicid_present = numaq_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = numaq_phys_pkg_id, + .mps_oem_check = numaq_mps_oem_check, + + .get_apic_id = numaq_get_apic_id, + .set_apic_id = NULL, + .apic_id_mask = 0x0F << 24, + + .cpu_mask_to_apicid = numaq_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = numaq_cpu_mask_to_apicid_and, + + .send_IPI_mask = numaq_send_IPI_mask, + .send_IPI_mask_allbutself = NULL, + .send_IPI_allbutself = numaq_send_IPI_allbutself, + .send_IPI_all = numaq_send_IPI_all, + .send_IPI_self = default_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = NUMAQ_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = NUMAQ_TRAMPOLINE_PHYS_HIGH, + + /* We don't do anything here because we use NMI's to boot instead */ + .wait_for_init_deassert = NULL, + + .smp_callin_clear_local_apic = numaq_smp_callin_clear_local_apic, + .inquire_remote_apic = NULL, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/probe_32.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/probe_32.c @@ -0,0 +1,295 @@ +/* + * Default generic APIC driver. This handles up to 8 CPUs. + * + * Copyright 2003 Andi Kleen, SuSE Labs. + * Subject to the GNU Public License, v.2 + * + * Generic x86 APIC driver probe layer. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_HOTPLUG_CPU +#define DEFAULT_SEND_IPI (1) +#else +#define DEFAULT_SEND_IPI (0) +#endif + +int no_broadcast = DEFAULT_SEND_IPI; + +static __init int no_ipi_broadcast(char *str) +{ + get_option(&str, &no_broadcast); + pr_info("Using %s mode\n", + no_broadcast ? "No IPI Broadcast" : "IPI Broadcast"); + return 1; +} +__setup("no_ipi_broadcast=", no_ipi_broadcast); + +static int __init print_ipi_mode(void) +{ + pr_info("Using IPI %s mode\n", + no_broadcast ? "No-Shortcut" : "Shortcut"); + return 0; +} +late_initcall(print_ipi_mode); + +void default_setup_apic_routing(void) +{ +#ifdef CONFIG_X86_IO_APIC + printk(KERN_INFO + "Enabling APIC mode: Flat. Using %d I/O APICs\n", + nr_ioapics); +#endif +} + +static void default_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + /* + * Careful. Some cpus do not strictly honor the set of cpus + * specified in the interrupt destination when using lowest + * priority interrupt delivery mode. + * + * In particular there was a hyperthreading cpu observed to + * deliver interrupts to the wrong hyperthread when only one + * hyperthread was specified in the interrupt desitination. + */ + *retmask = (cpumask_t) { { [0] = APIC_ALL_CPUS } }; +} + +/* should be called last. */ +static int probe_default(void) +{ + return 1; +} + +struct apic apic_default = { + + .name = "default", + .probe = probe_default, + .acpi_madt_oem_check = NULL, + .apic_id_registered = default_apic_id_registered, + + .irq_delivery_mode = dest_LowestPrio, + /* logical delivery broadcast to all CPUs: */ + .irq_dest_mode = 1, + + .target_cpus = default_target_cpus, + .disable_esr = 0, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = default_check_apicid_used, + .check_apicid_present = default_check_apicid_present, + + .vector_allocation_domain = default_vector_allocation_domain, + .init_apic_ldr = default_init_apic_ldr, + + .ioapic_phys_id_map = default_ioapic_phys_id_map, + .setup_apic_routing = default_setup_apic_routing, + .multi_timer_check = NULL, + .apicid_to_node = default_apicid_to_node, + .cpu_to_logical_apicid = default_cpu_to_logical_apicid, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = default_apicid_to_cpu_present, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = default_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = default_get_apic_id, + .set_apic_id = NULL, + .apic_id_mask = 0x0F << 24, + + .cpu_mask_to_apicid = default_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = default_cpu_mask_to_apicid_and, + + .send_IPI_mask = default_send_IPI_mask_logical, + .send_IPI_mask_allbutself = default_send_IPI_mask_allbutself_logical, + .send_IPI_allbutself = default_send_IPI_allbutself, + .send_IPI_all = default_send_IPI_all, + .send_IPI_self = default_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + + .wait_for_init_deassert = default_wait_for_init_deassert, + + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = default_inquire_remote_apic, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; + +extern struct apic apic_numaq; +extern struct apic apic_summit; +extern struct apic apic_bigsmp; +extern struct apic apic_es7000; +extern struct apic apic_default; + +struct apic *apic = &apic_default; +EXPORT_SYMBOL_GPL(apic); + +static struct apic *apic_probe[] __initdata = { +#ifdef CONFIG_X86_NUMAQ + &apic_numaq, +#endif +#ifdef CONFIG_X86_SUMMIT + &apic_summit, +#endif +#ifdef CONFIG_X86_BIGSMP + &apic_bigsmp, +#endif +#ifdef CONFIG_X86_ES7000 + &apic_es7000, +#endif + &apic_default, /* must be last */ + NULL, +}; + +static int cmdline_apic __initdata; +static int __init parse_apic(char *arg) +{ + int i; + + if (!arg) + return -EINVAL; + + for (i = 0; apic_probe[i]; i++) { + if (!strcmp(apic_probe[i]->name, arg)) { + apic = apic_probe[i]; + cmdline_apic = 1; + return 0; + } + } + + if (x86_quirks->update_apic) + x86_quirks->update_apic(); + + /* Parsed again by __setup for debug/verbose */ + return 0; +} +early_param("apic", parse_apic); + +void __init generic_bigsmp_probe(void) +{ +#ifdef CONFIG_X86_BIGSMP + /* + * This routine is used to switch to bigsmp mode when + * - There is no apic= option specified by the user + * - generic_apic_probe() has chosen apic_default as the sub_arch + * - we find more than 8 CPUs in acpi LAPIC listing with xAPIC support + */ + + if (!cmdline_apic && apic == &apic_default) { + if (apic_bigsmp.probe()) { + apic = &apic_bigsmp; + if (x86_quirks->update_apic) + x86_quirks->update_apic(); + printk(KERN_INFO "Overriding APIC driver with %s\n", + apic->name); + } + } +#endif +} + +void __init generic_apic_probe(void) +{ + if (!cmdline_apic) { + int i; + for (i = 0; apic_probe[i]; i++) { + if (apic_probe[i]->probe()) { + apic = apic_probe[i]; + break; + } + } + /* Not visible without early console */ + if (!apic_probe[i]) + panic("Didn't find an APIC driver"); + + if (x86_quirks->update_apic) + x86_quirks->update_apic(); + } + printk(KERN_INFO "Using APIC driver %s\n", apic->name); +} + +/* These functions can switch the APIC even after the initial ->probe() */ + +int __init +generic_mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) +{ + int i; + + for (i = 0; apic_probe[i]; ++i) { + if (!apic_probe[i]->mps_oem_check) + continue; + if (!apic_probe[i]->mps_oem_check(mpc, oem, productid)) + continue; + + if (!cmdline_apic) { + apic = apic_probe[i]; + if (x86_quirks->update_apic) + x86_quirks->update_apic(); + printk(KERN_INFO "Switched to APIC driver `%s'.\n", + apic->name); + } + return 1; + } + return 0; +} + +int __init default_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + int i; + + for (i = 0; apic_probe[i]; ++i) { + if (!apic_probe[i]->acpi_madt_oem_check) + continue; + if (!apic_probe[i]->acpi_madt_oem_check(oem_id, oem_table_id)) + continue; + + if (!cmdline_apic) { + apic = apic_probe[i]; + if (x86_quirks->update_apic) + x86_quirks->update_apic(); + printk(KERN_INFO "Switched to APIC driver `%s'.\n", + apic->name); + } + return 1; + } + return 0; +} Index: linux-2.6-tip/arch/x86/kernel/apic/probe_64.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/probe_64.c @@ -0,0 +1,96 @@ +/* + * Copyright 2004 James Cleverdon, IBM. + * Subject to the GNU Public License, v.2 + * + * Generic APIC sub-arch probe layer. + * + * Hacked for x86-64 by James Cleverdon from i386 architecture code by + * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and + * James Cleverdon. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +extern struct apic apic_flat; +extern struct apic apic_physflat; +extern struct apic apic_x2xpic_uv_x; +extern struct apic apic_x2apic_phys; +extern struct apic apic_x2apic_cluster; + +struct apic __read_mostly *apic = &apic_flat; +EXPORT_SYMBOL_GPL(apic); + +static struct apic *apic_probe[] __initdata = { +#ifdef CONFIG_X86_UV + &apic_x2apic_uv_x, +#endif +#ifdef CONFIG_X86_X2APIC + &apic_x2apic_phys, + &apic_x2apic_cluster, +#endif + &apic_physflat, + NULL, +}; + +/* + * Check the APIC IDs in bios_cpu_apicid and choose the APIC mode. + */ +void __init default_setup_apic_routing(void) +{ +#ifdef CONFIG_X86_X2APIC + if (x2apic && (apic != &apic_x2apic_phys && +#ifdef CONFIG_X86_UV + apic != &apic_x2apic_uv_x && +#endif + apic != &apic_x2apic_cluster)) { + if (x2apic_phys) + apic = &apic_x2apic_phys; + else + apic = &apic_x2apic_cluster; + printk(KERN_INFO "Setting APIC routing to %s\n", apic->name); + } +#endif + + if (apic == &apic_flat) { + if (max_physical_apicid >= 8) + apic = &apic_physflat; + printk(KERN_INFO "Setting APIC routing to %s\n", apic->name); + } + + if (x86_quirks->update_apic) + x86_quirks->update_apic(); +} + +/* Same for both flat and physical. */ + +void apic_send_IPI_self(int vector) +{ + __default_send_IPI_shortcut(APIC_DEST_SELF, vector, APIC_DEST_PHYSICAL); +} + +int __init default_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + int i; + + for (i = 0; apic_probe[i]; ++i) { + if (apic_probe[i]->acpi_madt_oem_check(oem_id, oem_table_id)) { + apic = apic_probe[i]; + printk(KERN_INFO "Setting APIC routing to %s.\n", + apic->name); + return 1; + } + } + return 0; +} Index: linux-2.6-tip/arch/x86/kernel/apic/summit_32.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/summit_32.c @@ -0,0 +1,601 @@ +/* + * IBM Summit-Specific Code + * + * Written By: Matthew Dobson, IBM Corporation + * + * Copyright (c) 2003 IBM Corp. + * + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or (at + * your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Send feedback to + * + */ + +#include +#include +#include +#include + +/* + * APIC driver for the IBM "Summit" chipset. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static inline unsigned summit_get_apic_id(unsigned long x) +{ + return (x >> 24) & 0xFF; +} + +static inline void summit_send_IPI_mask(const cpumask_t *mask, int vector) +{ + default_send_IPI_mask_sequence_logical(mask, vector); +} + +static inline void summit_send_IPI_allbutself(int vector) +{ + cpumask_t mask = cpu_online_map; + cpu_clear(smp_processor_id(), mask); + + if (!cpus_empty(mask)) + summit_send_IPI_mask(&mask, vector); +} + +static inline void summit_send_IPI_all(int vector) +{ + summit_send_IPI_mask(&cpu_online_map, vector); +} + +#include + +extern int use_cyclone; + +#ifdef CONFIG_X86_SUMMIT_NUMA +extern void setup_summit(void); +#else +#define setup_summit() {} +#endif + +static inline int +summit_mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) +{ + if (!strncmp(oem, "IBM ENSW", 8) && + (!strncmp(productid, "VIGIL SMP", 9) + || !strncmp(productid, "EXA", 3) + || !strncmp(productid, "RUTHLESS SMP", 12))){ + mark_tsc_unstable("Summit based system"); + use_cyclone = 1; /*enable cyclone-timer*/ + setup_summit(); + return 1; + } + return 0; +} + +/* Hook from generic ACPI tables.c */ +static inline int summit_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + if (!strncmp(oem_id, "IBM", 3) && + (!strncmp(oem_table_id, "SERVIGIL", 8) + || !strncmp(oem_table_id, "EXA", 3))){ + mark_tsc_unstable("Summit based system"); + use_cyclone = 1; /*enable cyclone-timer*/ + setup_summit(); + return 1; + } + return 0; +} + +struct rio_table_hdr { + unsigned char version; /* Version number of this data structure */ + /* Version 3 adds chassis_num & WP_index */ + unsigned char num_scal_dev; /* # of Scalability devices (Twisters for Vigil) */ + unsigned char num_rio_dev; /* # of RIO I/O devices (Cyclones and Winnipegs) */ +} __attribute__((packed)); + +struct scal_detail { + unsigned char node_id; /* Scalability Node ID */ + unsigned long CBAR; /* Address of 1MB register space */ + unsigned char port0node; /* Node ID port connected to: 0xFF=None */ + unsigned char port0port; /* Port num port connected to: 0,1,2, or 0xFF=None */ + unsigned char port1node; /* Node ID port connected to: 0xFF = None */ + unsigned char port1port; /* Port num port connected to: 0,1,2, or 0xFF=None */ + unsigned char port2node; /* Node ID port connected to: 0xFF = None */ + unsigned char port2port; /* Port num port connected to: 0,1,2, or 0xFF=None */ + unsigned char chassis_num; /* 1 based Chassis number (1 = boot node) */ +} __attribute__((packed)); + +struct rio_detail { + unsigned char node_id; /* RIO Node ID */ + unsigned long BBAR; /* Address of 1MB register space */ + unsigned char type; /* Type of device */ + unsigned char owner_id; /* For WPEG: Node ID of Cyclone that owns this WPEG*/ + /* For CYC: Node ID of Twister that owns this CYC */ + unsigned char port0node; /* Node ID port connected to: 0xFF=None */ + unsigned char port0port; /* Port num port connected to: 0,1,2, or 0xFF=None */ + unsigned char port1node; /* Node ID port connected to: 0xFF=None */ + unsigned char port1port; /* Port num port connected to: 0,1,2, or 0xFF=None */ + unsigned char first_slot; /* For WPEG: Lowest slot number below this WPEG */ + /* For CYC: 0 */ + unsigned char status; /* For WPEG: Bit 0 = 1 : the XAPIC is used */ + /* = 0 : the XAPIC is not used, ie:*/ + /* ints fwded to another XAPIC */ + /* Bits1:7 Reserved */ + /* For CYC: Bits0:7 Reserved */ + unsigned char WP_index; /* For WPEG: WPEG instance index - lower ones have */ + /* lower slot numbers/PCI bus numbers */ + /* For CYC: No meaning */ + unsigned char chassis_num; /* 1 based Chassis number */ + /* For LookOut WPEGs this field indicates the */ + /* Expansion Chassis #, enumerated from Boot */ + /* Node WPEG external port, then Boot Node CYC */ + /* external port, then Next Vigil chassis WPEG */ + /* external port, etc. */ + /* Shared Lookouts have only 1 chassis number (the */ + /* first one assigned) */ +} __attribute__((packed)); + + +typedef enum { + CompatTwister = 0, /* Compatibility Twister */ + AltTwister = 1, /* Alternate Twister of internal 8-way */ + CompatCyclone = 2, /* Compatibility Cyclone */ + AltCyclone = 3, /* Alternate Cyclone of internal 8-way */ + CompatWPEG = 4, /* Compatibility WPEG */ + AltWPEG = 5, /* Second Planar WPEG */ + LookOutAWPEG = 6, /* LookOut WPEG */ + LookOutBWPEG = 7, /* LookOut WPEG */ +} node_type; + +static inline int is_WPEG(struct rio_detail *rio){ + return (rio->type == CompatWPEG || rio->type == AltWPEG || + rio->type == LookOutAWPEG || rio->type == LookOutBWPEG); +} + + +/* In clustered mode, the high nibble of APIC ID is a cluster number. + * The low nibble is a 4-bit bitmap. */ +#define XAPIC_DEST_CPUS_SHIFT 4 +#define XAPIC_DEST_CPUS_MASK ((1u << XAPIC_DEST_CPUS_SHIFT) - 1) +#define XAPIC_DEST_CLUSTER_MASK (XAPIC_DEST_CPUS_MASK << XAPIC_DEST_CPUS_SHIFT) + +#define SUMMIT_APIC_DFR_VALUE (APIC_DFR_CLUSTER) + +static inline const cpumask_t *summit_target_cpus(void) +{ + /* CPU_MASK_ALL (0xff) has undefined behaviour with + * dest_LowestPrio mode logical clustered apic interrupt routing + * Just start on cpu 0. IRQ balancing will spread load + */ + return &cpumask_of_cpu(0); +} + +static inline unsigned long +summit_check_apicid_used(physid_mask_t bitmap, int apicid) +{ + return 0; +} + +/* we don't use the phys_cpu_present_map to indicate apicid presence */ +static inline unsigned long summit_check_apicid_present(int bit) +{ + return 1; +} + +static inline void summit_init_apic_ldr(void) +{ + unsigned long val, id; + int count = 0; + u8 my_id = (u8)hard_smp_processor_id(); + u8 my_cluster = APIC_CLUSTER(my_id); +#ifdef CONFIG_SMP + u8 lid; + int i; + + /* Create logical APIC IDs by counting CPUs already in cluster. */ + for (count = 0, i = nr_cpu_ids; --i >= 0; ) { + lid = cpu_2_logical_apicid[i]; + if (lid != BAD_APICID && APIC_CLUSTER(lid) == my_cluster) + ++count; + } +#endif + /* We only have a 4 wide bitmap in cluster mode. If a deranged + * BIOS puts 5 CPUs in one APIC cluster, we're hosed. */ + BUG_ON(count >= XAPIC_DEST_CPUS_SHIFT); + id = my_cluster | (1UL << count); + apic_write(APIC_DFR, SUMMIT_APIC_DFR_VALUE); + val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; + val |= SET_APIC_LOGICAL_ID(id); + apic_write(APIC_LDR, val); +} + +static inline int summit_apic_id_registered(void) +{ + return 1; +} + +static inline void summit_setup_apic_routing(void) +{ + printk("Enabling APIC mode: Summit. Using %d I/O APICs\n", + nr_ioapics); +} + +static inline int summit_apicid_to_node(int logical_apicid) +{ +#ifdef CONFIG_SMP + return apicid_2_node[hard_smp_processor_id()]; +#else + return 0; +#endif +} + +/* Mapping from cpu number to logical apicid */ +static inline int summit_cpu_to_logical_apicid(int cpu) +{ +#ifdef CONFIG_SMP + if (cpu >= nr_cpu_ids) + return BAD_APICID; + return cpu_2_logical_apicid[cpu]; +#else + return logical_smp_processor_id(); +#endif +} + +static inline int summit_cpu_present_to_apicid(int mps_cpu) +{ + if (mps_cpu < nr_cpu_ids) + return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu); + else + return BAD_APICID; +} + +static inline physid_mask_t +summit_ioapic_phys_id_map(physid_mask_t phys_id_map) +{ + /* For clustered we don't have a good way to do this yet - hack */ + return physids_promote(0x0F); +} + +static inline physid_mask_t summit_apicid_to_cpu_present(int apicid) +{ + return physid_mask_of_physid(0); +} + +static inline void summit_setup_portio_remap(void) +{ +} + +static inline int summit_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return 1; +} + +static inline unsigned int summit_cpu_mask_to_apicid(const cpumask_t *cpumask) +{ + int cpus_found = 0; + int num_bits_set; + int apicid; + int cpu; + + num_bits_set = cpus_weight(*cpumask); + /* Return id to all */ + if (num_bits_set >= nr_cpu_ids) + return 0xFF; + /* + * The cpus in the mask must all be on the apic cluster. If are not + * on the same apicid cluster return default value of target_cpus(): + */ + cpu = first_cpu(*cpumask); + apicid = summit_cpu_to_logical_apicid(cpu); + + while (cpus_found < num_bits_set) { + if (cpu_isset(cpu, *cpumask)) { + int new_apicid = summit_cpu_to_logical_apicid(cpu); + + if (APIC_CLUSTER(apicid) != APIC_CLUSTER(new_apicid)) { + printk ("%s: Not a valid mask!\n", __func__); + + return 0xFF; + } + apicid = apicid | new_apicid; + cpus_found++; + } + cpu++; + } + return apicid; +} + +static inline unsigned int +summit_cpu_mask_to_apicid_and(const struct cpumask *inmask, + const struct cpumask *andmask) +{ + int apicid = summit_cpu_to_logical_apicid(0); + cpumask_var_t cpumask; + + if (!alloc_cpumask_var(&cpumask, GFP_ATOMIC)) + return apicid; + + cpumask_and(cpumask, inmask, andmask); + cpumask_and(cpumask, cpumask, cpu_online_mask); + apicid = summit_cpu_mask_to_apicid(cpumask); + + free_cpumask_var(cpumask); + + return apicid; +} + +/* + * cpuid returns the value latched in the HW at reset, not the APIC ID + * register's value. For any box whose BIOS changes APIC IDs, like + * clustered APIC systems, we must use hard_smp_processor_id. + * + * See Intel's IA-32 SW Dev's Manual Vol2 under CPUID. + */ +static inline int summit_phys_pkg_id(int cpuid_apic, int index_msb) +{ + return hard_smp_processor_id() >> index_msb; +} + +static int probe_summit(void) +{ + /* probed later in mptable/ACPI hooks */ + return 0; +} + +static void summit_vector_allocation_domain(int cpu, cpumask_t *retmask) +{ + /* Careful. Some cpus do not strictly honor the set of cpus + * specified in the interrupt destination when using lowest + * priority interrupt delivery mode. + * + * In particular there was a hyperthreading cpu observed to + * deliver interrupts to the wrong hyperthread when only one + * hyperthread was specified in the interrupt desitination. + */ + *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; +} + +#ifdef CONFIG_X86_SUMMIT_NUMA +static struct rio_table_hdr *rio_table_hdr __initdata; +static struct scal_detail *scal_devs[MAX_NUMNODES] __initdata; +static struct rio_detail *rio_devs[MAX_NUMNODES*4] __initdata; + +#ifndef CONFIG_X86_NUMAQ +static int mp_bus_id_to_node[MAX_MP_BUSSES] __initdata; +#endif + +static int __init setup_pci_node_map_for_wpeg(int wpeg_num, int last_bus) +{ + int twister = 0, node = 0; + int i, bus, num_buses; + + for (i = 0; i < rio_table_hdr->num_rio_dev; i++) { + if (rio_devs[i]->node_id == rio_devs[wpeg_num]->owner_id) { + twister = rio_devs[i]->owner_id; + break; + } + } + if (i == rio_table_hdr->num_rio_dev) { + printk(KERN_ERR "%s: Couldn't find owner Cyclone for Winnipeg!\n", __func__); + return last_bus; + } + + for (i = 0; i < rio_table_hdr->num_scal_dev; i++) { + if (scal_devs[i]->node_id == twister) { + node = scal_devs[i]->node_id; + break; + } + } + if (i == rio_table_hdr->num_scal_dev) { + printk(KERN_ERR "%s: Couldn't find owner Twister for Cyclone!\n", __func__); + return last_bus; + } + + switch (rio_devs[wpeg_num]->type) { + case CompatWPEG: + /* + * The Compatibility Winnipeg controls the 2 legacy buses, + * the 66MHz PCI bus [2 slots] and the 2 "extra" buses in case + * a PCI-PCI bridge card is used in either slot: total 5 buses. + */ + num_buses = 5; + break; + case AltWPEG: + /* + * The Alternate Winnipeg controls the 2 133MHz buses [1 slot + * each], their 2 "extra" buses, the 100MHz bus [2 slots] and + * the "extra" buses for each of those slots: total 7 buses. + */ + num_buses = 7; + break; + case LookOutAWPEG: + case LookOutBWPEG: + /* + * A Lookout Winnipeg controls 3 100MHz buses [2 slots each] + * & the "extra" buses for each of those slots: total 9 buses. + */ + num_buses = 9; + break; + default: + printk(KERN_INFO "%s: Unsupported Winnipeg type!\n", __func__); + return last_bus; + } + + for (bus = last_bus; bus < last_bus + num_buses; bus++) + mp_bus_id_to_node[bus] = node; + return bus; +} + +static int __init build_detail_arrays(void) +{ + unsigned long ptr; + int i, scal_detail_size, rio_detail_size; + + if (rio_table_hdr->num_scal_dev > MAX_NUMNODES) { + printk(KERN_WARNING "%s: MAX_NUMNODES too low! Defined as %d, but system has %d nodes.\n", __func__, MAX_NUMNODES, rio_table_hdr->num_scal_dev); + return 0; + } + + switch (rio_table_hdr->version) { + default: + printk(KERN_WARNING "%s: Invalid Rio Grande Table Version: %d\n", __func__, rio_table_hdr->version); + return 0; + case 2: + scal_detail_size = 11; + rio_detail_size = 13; + break; + case 3: + scal_detail_size = 12; + rio_detail_size = 15; + break; + } + + ptr = (unsigned long)rio_table_hdr + 3; + for (i = 0; i < rio_table_hdr->num_scal_dev; i++, ptr += scal_detail_size) + scal_devs[i] = (struct scal_detail *)ptr; + + for (i = 0; i < rio_table_hdr->num_rio_dev; i++, ptr += rio_detail_size) + rio_devs[i] = (struct rio_detail *)ptr; + + return 1; +} + +void __init setup_summit(void) +{ + unsigned long ptr; + unsigned short offset; + int i, next_wpeg, next_bus = 0; + + /* The pointer to the EBDA is stored in the word @ phys 0x40E(40:0E) */ + ptr = get_bios_ebda(); + ptr = (unsigned long)phys_to_virt(ptr); + + rio_table_hdr = NULL; + offset = 0x180; + while (offset) { + /* The block id is stored in the 2nd word */ + if (*((unsigned short *)(ptr + offset + 2)) == 0x4752) { + /* set the pointer past the offset & block id */ + rio_table_hdr = (struct rio_table_hdr *)(ptr + offset + 4); + break; + } + /* The next offset is stored in the 1st word. 0 means no more */ + offset = *((unsigned short *)(ptr + offset)); + } + if (!rio_table_hdr) { + printk(KERN_ERR "%s: Unable to locate Rio Grande Table in EBDA - bailing!\n", __func__); + return; + } + + if (!build_detail_arrays()) + return; + + /* The first Winnipeg we're looking for has an index of 0 */ + next_wpeg = 0; + do { + for (i = 0; i < rio_table_hdr->num_rio_dev; i++) { + if (is_WPEG(rio_devs[i]) && rio_devs[i]->WP_index == next_wpeg) { + /* It's the Winnipeg we're looking for! */ + next_bus = setup_pci_node_map_for_wpeg(i, next_bus); + next_wpeg++; + break; + } + } + /* + * If we go through all Rio devices and don't find one with + * the next index, it means we've found all the Winnipegs, + * and thus all the PCI buses. + */ + if (i == rio_table_hdr->num_rio_dev) + next_wpeg = 0; + } while (next_wpeg != 0); +} +#endif + +struct apic apic_summit = { + + .name = "summit", + .probe = probe_summit, + .acpi_madt_oem_check = summit_acpi_madt_oem_check, + .apic_id_registered = summit_apic_id_registered, + + .irq_delivery_mode = dest_LowestPrio, + /* logical delivery broadcast to all CPUs: */ + .irq_dest_mode = 1, + + .target_cpus = summit_target_cpus, + .disable_esr = 1, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = summit_check_apicid_used, + .check_apicid_present = summit_check_apicid_present, + + .vector_allocation_domain = summit_vector_allocation_domain, + .init_apic_ldr = summit_init_apic_ldr, + + .ioapic_phys_id_map = summit_ioapic_phys_id_map, + .setup_apic_routing = summit_setup_apic_routing, + .multi_timer_check = NULL, + .apicid_to_node = summit_apicid_to_node, + .cpu_to_logical_apicid = summit_cpu_to_logical_apicid, + .cpu_present_to_apicid = summit_cpu_present_to_apicid, + .apicid_to_cpu_present = summit_apicid_to_cpu_present, + .setup_portio_remap = NULL, + .check_phys_apicid_present = summit_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = summit_phys_pkg_id, + .mps_oem_check = summit_mps_oem_check, + + .get_apic_id = summit_get_apic_id, + .set_apic_id = NULL, + .apic_id_mask = 0xFF << 24, + + .cpu_mask_to_apicid = summit_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = summit_cpu_mask_to_apicid_and, + + .send_IPI_mask = summit_send_IPI_mask, + .send_IPI_mask_allbutself = NULL, + .send_IPI_allbutself = summit_send_IPI_allbutself, + .send_IPI_all = summit_send_IPI_all, + .send_IPI_self = default_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + + .wait_for_init_deassert = default_wait_for_init_deassert, + + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = default_inquire_remote_apic, + + .read = native_apic_mem_read, + .write = native_apic_mem_write, + .icr_read = native_apic_icr_read, + .icr_write = native_apic_icr_write, + .wait_icr_idle = native_apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/x2apic_cluster.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/x2apic_cluster.c @@ -0,0 +1,240 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid); + +static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + return x2apic_enabled(); +} + +/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ + +static const struct cpumask *x2apic_target_cpus(void) +{ + return cpumask_of(0); +} + +/* + * for now each logical cpu is in its own vector allocation domain. + */ +static void x2apic_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + cpumask_clear(retmask); + cpumask_set_cpu(cpu, retmask); +} + +static void + __x2apic_send_IPI_dest(unsigned int apicid, int vector, unsigned int dest) +{ + unsigned long cfg; + + cfg = __prepare_ICR(0, vector, dest); + + /* + * send the IPI. + */ + native_x2apic_icr_write(cfg, apicid); +} + +/* + * for now, we send the IPI's one by one in the cpumask. + * TBD: Based on the cpu mask, we can send the IPI's to the cluster group + * at once. We have 16 cpu's in a cluster. This will minimize IPI register + * writes. + */ +static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector) +{ + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + __x2apic_send_IPI_dest( + per_cpu(x86_cpu_to_logical_apicid, query_cpu), + vector, apic->dest_logical); + } + local_irq_restore(flags); +} + +static void + x2apic_send_IPI_mask_allbutself(const struct cpumask *mask, int vector) +{ + unsigned long this_cpu = smp_processor_id(); + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + if (query_cpu == this_cpu) + continue; + __x2apic_send_IPI_dest( + per_cpu(x86_cpu_to_logical_apicid, query_cpu), + vector, apic->dest_logical); + } + local_irq_restore(flags); +} + +static void x2apic_send_IPI_allbutself(int vector) +{ + unsigned long this_cpu = smp_processor_id(); + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_online_cpu(query_cpu) { + if (query_cpu == this_cpu) + continue; + __x2apic_send_IPI_dest( + per_cpu(x86_cpu_to_logical_apicid, query_cpu), + vector, apic->dest_logical); + } + local_irq_restore(flags); +} + +static void x2apic_send_IPI_all(int vector) +{ + x2apic_send_IPI_mask(cpu_online_mask, vector); +} + +static int x2apic_apic_id_registered(void) +{ + return 1; +} + +static unsigned int x2apic_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + /* + * We're using fixed IRQ delivery, can only return one logical APIC ID. + * May as well be the first. + */ + int cpu = cpumask_first(cpumask); + + if ((unsigned)cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_logical_apicid, cpu); + else + return BAD_APICID; +} + +static unsigned int +x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one logical APIC ID. + * May as well be the first. + */ + for_each_cpu_and(cpu, cpumask, andmask) { + if (cpumask_test_cpu(cpu, cpu_online_mask)) + break; + } + + if (cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_logical_apicid, cpu); + + return BAD_APICID; +} + +static unsigned int x2apic_cluster_phys_get_apic_id(unsigned long x) +{ + unsigned int id; + + id = x; + return id; +} + +static unsigned long set_apic_id(unsigned int id) +{ + unsigned long x; + + x = id; + return x; +} + +static int x2apic_cluster_phys_pkg_id(int initial_apicid, int index_msb) +{ + return current_cpu_data.initial_apicid >> index_msb; +} + +static void x2apic_send_IPI_self(int vector) +{ + apic_write(APIC_SELF_IPI, vector); +} + +static void init_x2apic_ldr(void) +{ + int cpu = smp_processor_id(); + + per_cpu(x86_cpu_to_logical_apicid, cpu) = apic_read(APIC_LDR); +} + +struct apic apic_x2apic_cluster = { + + .name = "cluster x2apic", + .probe = NULL, + .acpi_madt_oem_check = x2apic_acpi_madt_oem_check, + .apic_id_registered = x2apic_apic_id_registered, + + .irq_delivery_mode = dest_LowestPrio, + .irq_dest_mode = 1, /* logical */ + + .target_cpus = x2apic_target_cpus, + .disable_esr = 0, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = NULL, + .check_apicid_present = NULL, + + .vector_allocation_domain = x2apic_vector_allocation_domain, + .init_apic_ldr = init_x2apic_ldr, + + .ioapic_phys_id_map = NULL, + .setup_apic_routing = NULL, + .multi_timer_check = NULL, + .apicid_to_node = NULL, + .cpu_to_logical_apicid = NULL, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = NULL, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = x2apic_cluster_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = x2apic_cluster_phys_get_apic_id, + .set_apic_id = set_apic_id, + .apic_id_mask = 0xFFFFFFFFu, + + .cpu_mask_to_apicid = x2apic_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and, + + .send_IPI_mask = x2apic_send_IPI_mask, + .send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself, + .send_IPI_allbutself = x2apic_send_IPI_allbutself, + .send_IPI_all = x2apic_send_IPI_all, + .send_IPI_self = x2apic_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + .wait_for_init_deassert = NULL, + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = NULL, + + .read = native_apic_msr_read, + .write = native_apic_msr_write, + .icr_read = native_x2apic_icr_read, + .icr_write = native_x2apic_icr_write, + .wait_icr_idle = native_x2apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_x2apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/x2apic_phys.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/x2apic_phys.c @@ -0,0 +1,229 @@ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +int x2apic_phys; + +static int set_x2apic_phys_mode(char *arg) +{ + x2apic_phys = 1; + return 0; +} +early_param("x2apic_phys", set_x2apic_phys_mode); + +static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + if (x2apic_phys) + return x2apic_enabled(); + else + return 0; +} + +/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ + +static const struct cpumask *x2apic_target_cpus(void) +{ + return cpumask_of(0); +} + +static void x2apic_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + cpumask_clear(retmask); + cpumask_set_cpu(cpu, retmask); +} + +static void __x2apic_send_IPI_dest(unsigned int apicid, int vector, + unsigned int dest) +{ + unsigned long cfg; + + cfg = __prepare_ICR(0, vector, dest); + + /* + * send the IPI. + */ + native_x2apic_icr_write(cfg, apicid); +} + +static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector) +{ + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + __x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu), + vector, APIC_DEST_PHYSICAL); + } + local_irq_restore(flags); +} + +static void + x2apic_send_IPI_mask_allbutself(const struct cpumask *mask, int vector) +{ + unsigned long this_cpu = smp_processor_id(); + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_cpu(query_cpu, mask) { + if (query_cpu != this_cpu) + __x2apic_send_IPI_dest( + per_cpu(x86_cpu_to_apicid, query_cpu), + vector, APIC_DEST_PHYSICAL); + } + local_irq_restore(flags); +} + +static void x2apic_send_IPI_allbutself(int vector) +{ + unsigned long this_cpu = smp_processor_id(); + unsigned long query_cpu; + unsigned long flags; + + local_irq_save(flags); + for_each_online_cpu(query_cpu) { + if (query_cpu == this_cpu) + continue; + __x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu), + vector, APIC_DEST_PHYSICAL); + } + local_irq_restore(flags); +} + +static void x2apic_send_IPI_all(int vector) +{ + x2apic_send_IPI_mask(cpu_online_mask, vector); +} + +static int x2apic_apic_id_registered(void) +{ + return 1; +} + +static unsigned int x2apic_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + int cpu = cpumask_first(cpumask); + + if ((unsigned)cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + else + return BAD_APICID; +} + +static unsigned int +x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + for_each_cpu_and(cpu, cpumask, andmask) { + if (cpumask_test_cpu(cpu, cpu_online_mask)) + break; + } + + if (cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + + return BAD_APICID; +} + +static unsigned int x2apic_phys_get_apic_id(unsigned long x) +{ + return x; +} + +static unsigned long set_apic_id(unsigned int id) +{ + return id; +} + +static int x2apic_phys_pkg_id(int initial_apicid, int index_msb) +{ + return current_cpu_data.initial_apicid >> index_msb; +} + +static void x2apic_send_IPI_self(int vector) +{ + apic_write(APIC_SELF_IPI, vector); +} + +static void init_x2apic_ldr(void) +{ +} + +struct apic apic_x2apic_phys = { + + .name = "physical x2apic", + .probe = NULL, + .acpi_madt_oem_check = x2apic_acpi_madt_oem_check, + .apic_id_registered = x2apic_apic_id_registered, + + .irq_delivery_mode = dest_Fixed, + .irq_dest_mode = 0, /* physical */ + + .target_cpus = x2apic_target_cpus, + .disable_esr = 0, + .dest_logical = 0, + .check_apicid_used = NULL, + .check_apicid_present = NULL, + + .vector_allocation_domain = x2apic_vector_allocation_domain, + .init_apic_ldr = init_x2apic_ldr, + + .ioapic_phys_id_map = NULL, + .setup_apic_routing = NULL, + .multi_timer_check = NULL, + .apicid_to_node = NULL, + .cpu_to_logical_apicid = NULL, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = NULL, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = x2apic_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = x2apic_phys_get_apic_id, + .set_apic_id = set_apic_id, + .apic_id_mask = 0xFFFFFFFFu, + + .cpu_mask_to_apicid = x2apic_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and, + + .send_IPI_mask = x2apic_send_IPI_mask, + .send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself, + .send_IPI_allbutself = x2apic_send_IPI_allbutself, + .send_IPI_all = x2apic_send_IPI_all, + .send_IPI_self = x2apic_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + .wait_for_init_deassert = NULL, + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = NULL, + + .read = native_apic_msr_read, + .write = native_apic_msr_write, + .icr_read = native_x2apic_icr_read, + .icr_write = native_x2apic_icr_write, + .wait_icr_idle = native_x2apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_x2apic_wait_icr_idle, +}; Index: linux-2.6-tip/arch/x86/kernel/apic/x2apic_uv_x.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/apic/x2apic_uv_x.c @@ -0,0 +1,644 @@ +/* + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + * + * SGI UV APIC functions (note: not an Intel compatible APIC) + * + * Copyright (C) 2007-2008 Silicon Graphics, Inc. All rights reserved. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +DEFINE_PER_CPU(int, x2apic_extra_bits); + +static enum uv_system_type uv_system_type; + +static int uv_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + if (!strcmp(oem_id, "SGI")) { + if (!strcmp(oem_table_id, "UVL")) + uv_system_type = UV_LEGACY_APIC; + else if (!strcmp(oem_table_id, "UVX")) + uv_system_type = UV_X2APIC; + else if (!strcmp(oem_table_id, "UVH")) { + uv_system_type = UV_NON_UNIQUE_APIC; + return 1; + } + } + return 0; +} + +enum uv_system_type get_uv_system_type(void) +{ + return uv_system_type; +} + +int is_uv_system(void) +{ + return uv_system_type != UV_NONE; +} +EXPORT_SYMBOL_GPL(is_uv_system); + +DEFINE_PER_CPU(struct uv_hub_info_s, __uv_hub_info); +EXPORT_PER_CPU_SYMBOL_GPL(__uv_hub_info); + +struct uv_blade_info *uv_blade_info; +EXPORT_SYMBOL_GPL(uv_blade_info); + +short *uv_node_to_blade; +EXPORT_SYMBOL_GPL(uv_node_to_blade); + +short *uv_cpu_to_blade; +EXPORT_SYMBOL_GPL(uv_cpu_to_blade); + +short uv_possible_blades; +EXPORT_SYMBOL_GPL(uv_possible_blades); + +unsigned long sn_rtc_cycles_per_second; +EXPORT_SYMBOL(sn_rtc_cycles_per_second); + +/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ + +static const struct cpumask *uv_target_cpus(void) +{ + return cpumask_of(0); +} + +static void uv_vector_allocation_domain(int cpu, struct cpumask *retmask) +{ + cpumask_clear(retmask); + cpumask_set_cpu(cpu, retmask); +} + +int uv_wakeup_secondary(int phys_apicid, unsigned int start_rip) +{ + unsigned long val; + int pnode; + + pnode = uv_apicid_to_pnode(phys_apicid); + val = (1UL << UVH_IPI_INT_SEND_SHFT) | + (phys_apicid << UVH_IPI_INT_APIC_ID_SHFT) | + (((long)start_rip << UVH_IPI_INT_VECTOR_SHFT) >> 12) | + APIC_DM_INIT; + uv_write_global_mmr64(pnode, UVH_IPI_INT, val); + mdelay(10); + + val = (1UL << UVH_IPI_INT_SEND_SHFT) | + (phys_apicid << UVH_IPI_INT_APIC_ID_SHFT) | + (((long)start_rip << UVH_IPI_INT_VECTOR_SHFT) >> 12) | + APIC_DM_STARTUP; + uv_write_global_mmr64(pnode, UVH_IPI_INT, val); + return 0; +} + +static void uv_send_IPI_one(int cpu, int vector) +{ + unsigned long val, apicid; + int pnode; + + apicid = per_cpu(x86_cpu_to_apicid, cpu); + pnode = uv_apicid_to_pnode(apicid); + + val = (1UL << UVH_IPI_INT_SEND_SHFT) | + (apicid << UVH_IPI_INT_APIC_ID_SHFT) | + (vector << UVH_IPI_INT_VECTOR_SHFT); + + uv_write_global_mmr64(pnode, UVH_IPI_INT, val); +} + +static void uv_send_IPI_mask(const struct cpumask *mask, int vector) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + uv_send_IPI_one(cpu, vector); +} + +static void uv_send_IPI_mask_allbutself(const struct cpumask *mask, int vector) +{ + unsigned int this_cpu = smp_processor_id(); + unsigned int cpu; + + for_each_cpu(cpu, mask) { + if (cpu != this_cpu) + uv_send_IPI_one(cpu, vector); + } +} + +static void uv_send_IPI_allbutself(int vector) +{ + unsigned int this_cpu = smp_processor_id(); + unsigned int cpu; + + for_each_online_cpu(cpu) { + if (cpu != this_cpu) + uv_send_IPI_one(cpu, vector); + } +} + +static void uv_send_IPI_all(int vector) +{ + uv_send_IPI_mask(cpu_online_mask, vector); +} + +static int uv_apic_id_registered(void) +{ + return 1; +} + +static void uv_init_apic_ldr(void) +{ +} + +static unsigned int uv_cpu_mask_to_apicid(const struct cpumask *cpumask) +{ + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + int cpu = cpumask_first(cpumask); + + if ((unsigned)cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + else + return BAD_APICID; +} + +static unsigned int +uv_cpu_mask_to_apicid_and(const struct cpumask *cpumask, + const struct cpumask *andmask) +{ + int cpu; + + /* + * We're using fixed IRQ delivery, can only return one phys APIC ID. + * May as well be the first. + */ + for_each_cpu_and(cpu, cpumask, andmask) { + if (cpumask_test_cpu(cpu, cpu_online_mask)) + break; + } + if (cpu < nr_cpu_ids) + return per_cpu(x86_cpu_to_apicid, cpu); + + return BAD_APICID; +} + +static unsigned int x2apic_get_apic_id(unsigned long x) +{ + unsigned int id; + + WARN_ON(preemptible() && num_online_cpus() > 1); + id = x | __get_cpu_var(x2apic_extra_bits); + + return id; +} + +static unsigned long set_apic_id(unsigned int id) +{ + unsigned long x; + + /* maskout x2apic_extra_bits ? */ + x = id; + return x; +} + +static unsigned int uv_read_apic_id(void) +{ + + return x2apic_get_apic_id(apic_read(APIC_ID)); +} + +static int uv_phys_pkg_id(int initial_apicid, int index_msb) +{ + return uv_read_apic_id() >> index_msb; +} + +static void uv_send_IPI_self(int vector) +{ + apic_write(APIC_SELF_IPI, vector); +} + +struct apic apic_x2apic_uv_x = { + + .name = "UV large system", + .probe = NULL, + .acpi_madt_oem_check = uv_acpi_madt_oem_check, + .apic_id_registered = uv_apic_id_registered, + + .irq_delivery_mode = dest_Fixed, + .irq_dest_mode = 1, /* logical */ + + .target_cpus = uv_target_cpus, + .disable_esr = 0, + .dest_logical = APIC_DEST_LOGICAL, + .check_apicid_used = NULL, + .check_apicid_present = NULL, + + .vector_allocation_domain = uv_vector_allocation_domain, + .init_apic_ldr = uv_init_apic_ldr, + + .ioapic_phys_id_map = NULL, + .setup_apic_routing = NULL, + .multi_timer_check = NULL, + .apicid_to_node = NULL, + .cpu_to_logical_apicid = NULL, + .cpu_present_to_apicid = default_cpu_present_to_apicid, + .apicid_to_cpu_present = NULL, + .setup_portio_remap = NULL, + .check_phys_apicid_present = default_check_phys_apicid_present, + .enable_apic_mode = NULL, + .phys_pkg_id = uv_phys_pkg_id, + .mps_oem_check = NULL, + + .get_apic_id = x2apic_get_apic_id, + .set_apic_id = set_apic_id, + .apic_id_mask = 0xFFFFFFFFu, + + .cpu_mask_to_apicid = uv_cpu_mask_to_apicid, + .cpu_mask_to_apicid_and = uv_cpu_mask_to_apicid_and, + + .send_IPI_mask = uv_send_IPI_mask, + .send_IPI_mask_allbutself = uv_send_IPI_mask_allbutself, + .send_IPI_allbutself = uv_send_IPI_allbutself, + .send_IPI_all = uv_send_IPI_all, + .send_IPI_self = uv_send_IPI_self, + + .wakeup_cpu = NULL, + .trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW, + .trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH, + .wait_for_init_deassert = NULL, + .smp_callin_clear_local_apic = NULL, + .inquire_remote_apic = NULL, + + .read = native_apic_msr_read, + .write = native_apic_msr_write, + .icr_read = native_x2apic_icr_read, + .icr_write = native_x2apic_icr_write, + .wait_icr_idle = native_x2apic_wait_icr_idle, + .safe_wait_icr_idle = native_safe_x2apic_wait_icr_idle, +}; + +static __cpuinit void set_x2apic_extra_bits(int pnode) +{ + __get_cpu_var(x2apic_extra_bits) = (pnode << 6); +} + +/* + * Called on boot cpu. + */ +static __init int boot_pnode_to_blade(int pnode) +{ + int blade; + + for (blade = 0; blade < uv_num_possible_blades(); blade++) + if (pnode == uv_blade_info[blade].pnode) + return blade; + + panic("x2apic_uv: bad pnode!"); +} + +struct redir_addr { + unsigned long redirect; + unsigned long alias; +}; + +#define DEST_SHIFT UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_0_MMR_DEST_BASE_SHFT + +static __initdata struct redir_addr redir_addrs[] = { + {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_0_MMR, UVH_SI_ALIAS0_OVERLAY_CONFIG}, + {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_1_MMR, UVH_SI_ALIAS1_OVERLAY_CONFIG}, + {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_2_MMR, UVH_SI_ALIAS2_OVERLAY_CONFIG}, +}; + +static __init void get_lowmem_redirect(unsigned long *base, unsigned long *size) +{ + union uvh_si_alias0_overlay_config_u alias; + union uvh_rh_gam_alias210_redirect_config_2_mmr_u redirect; + int i; + + for (i = 0; i < ARRAY_SIZE(redir_addrs); i++) { + alias.v = uv_read_local_mmr(redir_addrs[i].alias); + if (alias.s.base == 0) { + *size = (1UL << alias.s.m_alias); + redirect.v = uv_read_local_mmr(redir_addrs[i].redirect); + *base = (unsigned long)redirect.s.dest_base << DEST_SHIFT; + return; + } + } + panic("get_lowmem_redirect: no match!"); +} + +static __init void map_low_mmrs(void) +{ + init_extra_mapping_uc(UV_GLOBAL_MMR32_BASE, UV_GLOBAL_MMR32_SIZE); + init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE); +} + +enum map_type {map_wb, map_uc}; + +static __init void map_high(char *id, unsigned long base, int shift, + int max_pnode, enum map_type map_type) +{ + unsigned long bytes, paddr; + + paddr = base << shift; + bytes = (1UL << shift) * (max_pnode + 1); + printk(KERN_INFO "UV: Map %s_HI 0x%lx - 0x%lx\n", id, paddr, + paddr + bytes); + if (map_type == map_uc) + init_extra_mapping_uc(paddr, bytes); + else + init_extra_mapping_wb(paddr, bytes); + +} +static __init void map_gru_high(int max_pnode) +{ + union uvh_rh_gam_gru_overlay_config_mmr_u gru; + int shift = UVH_RH_GAM_GRU_OVERLAY_CONFIG_MMR_BASE_SHFT; + + gru.v = uv_read_local_mmr(UVH_RH_GAM_GRU_OVERLAY_CONFIG_MMR); + if (gru.s.enable) + map_high("GRU", gru.s.base, shift, max_pnode, map_wb); +} + +static __init void map_config_high(int max_pnode) +{ + union uvh_rh_gam_cfg_overlay_config_mmr_u cfg; + int shift = UVH_RH_GAM_CFG_OVERLAY_CONFIG_MMR_BASE_SHFT; + + cfg.v = uv_read_local_mmr(UVH_RH_GAM_CFG_OVERLAY_CONFIG_MMR); + if (cfg.s.enable) + map_high("CONFIG", cfg.s.base, shift, max_pnode, map_uc); +} + +static __init void map_mmr_high(int max_pnode) +{ + union uvh_rh_gam_mmr_overlay_config_mmr_u mmr; + int shift = UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR_BASE_SHFT; + + mmr.v = uv_read_local_mmr(UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR); + if (mmr.s.enable) + map_high("MMR", mmr.s.base, shift, max_pnode, map_uc); +} + +static __init void map_mmioh_high(int max_pnode) +{ + union uvh_rh_gam_mmioh_overlay_config_mmr_u mmioh; + int shift = UVH_RH_GAM_MMIOH_OVERLAY_CONFIG_MMR_BASE_SHFT; + + mmioh.v = uv_read_local_mmr(UVH_RH_GAM_MMIOH_OVERLAY_CONFIG_MMR); + if (mmioh.s.enable) + map_high("MMIOH", mmioh.s.base, shift, max_pnode, map_uc); +} + +static __init void uv_rtc_init(void) +{ + long status; + u64 ticks_per_sec; + + status = uv_bios_freq_base(BIOS_FREQ_BASE_REALTIME_CLOCK, + &ticks_per_sec); + if (status != BIOS_STATUS_SUCCESS || ticks_per_sec < 100000) { + printk(KERN_WARNING + "unable to determine platform RTC clock frequency, " + "guessing.\n"); + /* BIOS gives wrong value for clock freq. so guess */ + sn_rtc_cycles_per_second = 1000000000000UL / 30000UL; + } else + sn_rtc_cycles_per_second = ticks_per_sec; +} + +/* + * percpu heartbeat timer + */ +static void uv_heartbeat(unsigned long ignored) +{ + struct timer_list *timer = &uv_hub_info->scir.timer; + unsigned char bits = uv_hub_info->scir.state; + + /* flip heartbeat bit */ + bits ^= SCIR_CPU_HEARTBEAT; + + /* is this cpu idle? */ + if (idle_cpu(raw_smp_processor_id())) + bits &= ~SCIR_CPU_ACTIVITY; + else + bits |= SCIR_CPU_ACTIVITY; + + /* update system controller interface reg */ + uv_set_scir_bits(bits); + + /* enable next timer period */ + mod_timer(timer, jiffies + SCIR_CPU_HB_INTERVAL); +} + +static void __cpuinit uv_heartbeat_enable(int cpu) +{ + if (!uv_cpu_hub_info(cpu)->scir.enabled) { + struct timer_list *timer = &uv_cpu_hub_info(cpu)->scir.timer; + + uv_set_cpu_scir_bits(cpu, SCIR_CPU_HEARTBEAT|SCIR_CPU_ACTIVITY); + setup_timer(timer, uv_heartbeat, cpu); + timer->expires = jiffies + SCIR_CPU_HB_INTERVAL; + add_timer_on(timer, cpu); + uv_cpu_hub_info(cpu)->scir.enabled = 1; + } + + /* check boot cpu */ + if (!uv_cpu_hub_info(0)->scir.enabled) + uv_heartbeat_enable(0); +} + +#ifdef CONFIG_HOTPLUG_CPU +static void __cpuinit uv_heartbeat_disable(int cpu) +{ + if (uv_cpu_hub_info(cpu)->scir.enabled) { + uv_cpu_hub_info(cpu)->scir.enabled = 0; + del_timer(&uv_cpu_hub_info(cpu)->scir.timer); + } + uv_set_cpu_scir_bits(cpu, 0xff); +} + +/* + * cpu hotplug notifier + */ +static __cpuinit int uv_scir_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action) { + case CPU_ONLINE: + uv_heartbeat_enable(cpu); + break; + case CPU_DOWN_PREPARE: + uv_heartbeat_disable(cpu); + break; + default: + break; + } + return NOTIFY_OK; +} + +static __init void uv_scir_register_cpu_notifier(void) +{ + hotcpu_notifier(uv_scir_cpu_notify, 0); +} + +#else /* !CONFIG_HOTPLUG_CPU */ + +static __init void uv_scir_register_cpu_notifier(void) +{ +} + +static __init int uv_init_heartbeat(void) +{ + int cpu; + + if (is_uv_system()) + for_each_online_cpu(cpu) + uv_heartbeat_enable(cpu); + return 0; +} + +late_initcall(uv_init_heartbeat); + +#endif /* !CONFIG_HOTPLUG_CPU */ + +/* + * Called on each cpu to initialize the per_cpu UV data area. + * ZZZ hotplug not supported yet + */ +void __cpuinit uv_cpu_init(void) +{ + /* CPU 0 initilization will be done via uv_system_init. */ + if (!uv_blade_info) + return; + + uv_blade_info[uv_numa_blade_id()].nr_online_cpus++; + + if (get_uv_system_type() == UV_NON_UNIQUE_APIC) + set_x2apic_extra_bits(uv_hub_info->pnode); +} + + +void __init uv_system_init(void) +{ + union uvh_si_addr_map_config_u m_n_config; + union uvh_node_id_u node_id; + unsigned long gnode_upper, lowmem_redir_base, lowmem_redir_size; + int bytes, nid, cpu, lcpu, pnode, blade, i, j, m_val, n_val; + int max_pnode = 0; + unsigned long mmr_base, present; + + map_low_mmrs(); + + m_n_config.v = uv_read_local_mmr(UVH_SI_ADDR_MAP_CONFIG); + m_val = m_n_config.s.m_skt; + n_val = m_n_config.s.n_skt; + mmr_base = + uv_read_local_mmr(UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR) & + ~UV_MMR_ENABLE; + printk(KERN_DEBUG "UV: global MMR base 0x%lx\n", mmr_base); + + for(i = 0; i < UVH_NODE_PRESENT_TABLE_DEPTH; i++) + uv_possible_blades += + hweight64(uv_read_local_mmr( UVH_NODE_PRESENT_TABLE + i * 8)); + printk(KERN_DEBUG "UV: Found %d blades\n", uv_num_possible_blades()); + + bytes = sizeof(struct uv_blade_info) * uv_num_possible_blades(); + uv_blade_info = kmalloc(bytes, GFP_KERNEL); + + get_lowmem_redirect(&lowmem_redir_base, &lowmem_redir_size); + + bytes = sizeof(uv_node_to_blade[0]) * num_possible_nodes(); + uv_node_to_blade = kmalloc(bytes, GFP_KERNEL); + memset(uv_node_to_blade, 255, bytes); + + bytes = sizeof(uv_cpu_to_blade[0]) * num_possible_cpus(); + uv_cpu_to_blade = kmalloc(bytes, GFP_KERNEL); + memset(uv_cpu_to_blade, 255, bytes); + + blade = 0; + for (i = 0; i < UVH_NODE_PRESENT_TABLE_DEPTH; i++) { + present = uv_read_local_mmr(UVH_NODE_PRESENT_TABLE + i * 8); + for (j = 0; j < 64; j++) { + if (!test_bit(j, &present)) + continue; + uv_blade_info[blade].pnode = (i * 64 + j); + uv_blade_info[blade].nr_possible_cpus = 0; + uv_blade_info[blade].nr_online_cpus = 0; + blade++; + } + } + + node_id.v = uv_read_local_mmr(UVH_NODE_ID); + gnode_upper = (((unsigned long)node_id.s.node_id) & + ~((1 << n_val) - 1)) << m_val; + + uv_bios_init(); + uv_bios_get_sn_info(0, &uv_type, &sn_partition_id, + &sn_coherency_id, &sn_region_size); + uv_rtc_init(); + + for_each_present_cpu(cpu) { + nid = cpu_to_node(cpu); + pnode = uv_apicid_to_pnode(per_cpu(x86_cpu_to_apicid, cpu)); + blade = boot_pnode_to_blade(pnode); + lcpu = uv_blade_info[blade].nr_possible_cpus; + uv_blade_info[blade].nr_possible_cpus++; + + uv_cpu_hub_info(cpu)->lowmem_remap_base = lowmem_redir_base; + uv_cpu_hub_info(cpu)->lowmem_remap_top = lowmem_redir_size; + uv_cpu_hub_info(cpu)->m_val = m_val; + uv_cpu_hub_info(cpu)->n_val = m_val; + uv_cpu_hub_info(cpu)->numa_blade_id = blade; + uv_cpu_hub_info(cpu)->blade_processor_id = lcpu; + uv_cpu_hub_info(cpu)->pnode = pnode; + uv_cpu_hub_info(cpu)->pnode_mask = (1 << n_val) - 1; + uv_cpu_hub_info(cpu)->gpa_mask = (1 << (m_val + n_val)) - 1; + uv_cpu_hub_info(cpu)->gnode_upper = gnode_upper; + uv_cpu_hub_info(cpu)->global_mmr_base = mmr_base; + uv_cpu_hub_info(cpu)->coherency_domain_number = sn_coherency_id; + uv_cpu_hub_info(cpu)->scir.offset = SCIR_LOCAL_MMR_BASE + lcpu; + uv_node_to_blade[nid] = blade; + uv_cpu_to_blade[cpu] = blade; + max_pnode = max(pnode, max_pnode); + + printk(KERN_DEBUG "UV: cpu %d, apicid 0x%x, pnode %d, nid %d, " + "lcpu %d, blade %d\n", + cpu, per_cpu(x86_cpu_to_apicid, cpu), pnode, nid, + lcpu, blade); + } + + map_gru_high(max_pnode); + map_mmr_high(max_pnode); + map_config_high(max_pnode); + map_mmioh_high(max_pnode); + + uv_cpu_init(); + uv_scir_register_cpu_notifier(); + proc_mkdir("sgi_uv", NULL); +} Index: linux-2.6-tip/arch/x86/kernel/apm_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/apm_32.c +++ linux-2.6-tip/arch/x86/kernel/apm_32.c @@ -301,7 +301,7 @@ extern int (*console_blank_hook)(int); */ #define APM_ZERO_SEGS -#include "apm.h" +#include /* * Define to re-initialize the interrupt 0 timer to 100 Hz after a suspend. Index: linux-2.6-tip/arch/x86/kernel/asm-offsets_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/asm-offsets_32.c +++ linux-2.6-tip/arch/x86/kernel/asm-offsets_32.c @@ -75,6 +75,7 @@ void foo(void) OFFSET(PT_DS, pt_regs, ds); OFFSET(PT_ES, pt_regs, es); OFFSET(PT_FS, pt_regs, fs); + OFFSET(PT_GS, pt_regs, gs); OFFSET(PT_ORIG_EAX, pt_regs, orig_ax); OFFSET(PT_EIP, pt_regs, ip); OFFSET(PT_CS, pt_regs, cs); Index: linux-2.6-tip/arch/x86/kernel/asm-offsets_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/asm-offsets_64.c +++ linux-2.6-tip/arch/x86/kernel/asm-offsets_64.c @@ -11,7 +11,6 @@ #include #include #include -#include #include #include #include @@ -48,16 +47,6 @@ int main(void) #endif BLANK(); #undef ENTRY -#define ENTRY(entry) DEFINE(pda_ ## entry, offsetof(struct x8664_pda, entry)) - ENTRY(kernelstack); - ENTRY(oldrsp); - ENTRY(pcurrent); - ENTRY(irqcount); - ENTRY(cpunumber); - ENTRY(irqstackptr); - ENTRY(data_offset); - BLANK(); -#undef ENTRY #ifdef CONFIG_PARAVIRT BLANK(); OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled); Index: linux-2.6-tip/arch/x86/kernel/cpu/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/Makefile +++ linux-2.6-tip/arch/x86/kernel/cpu/Makefile @@ -1,5 +1,5 @@ # -# Makefile for x86-compatible CPU details and quirks +# Makefile for x86-compatible CPU details, features and quirks # # Don't trace early stages of a secondary CPU boot @@ -22,11 +22,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64) += cent obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o -obj-$(CONFIG_X86_MCE) += mcheck/ -obj-$(CONFIG_MTRR) += mtrr/ -obj-$(CONFIG_CPU_FREQ) += cpufreq/ +obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o -obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o +obj-$(CONFIG_X86_MCE) += mcheck/ +obj-$(CONFIG_MTRR) += mtrr/ +obj-$(CONFIG_CPU_FREQ) += cpufreq/ + +obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o quiet_cmd_mkcapflags = MKCAP $@ cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@ Index: linux-2.6-tip/arch/x86/kernel/cpu/addon_cpuid_features.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/addon_cpuid_features.c +++ linux-2.6-tip/arch/x86/kernel/cpu/addon_cpuid_features.c @@ -7,7 +7,7 @@ #include #include -#include +#include struct cpuid_bit { u16 feature; @@ -69,7 +69,7 @@ void __cpuinit init_scattered_cpuid_feat */ void __cpuinit detect_extended_topology(struct cpuinfo_x86 *c) { -#ifdef CONFIG_X86_SMP +#ifdef CONFIG_SMP unsigned int eax, ebx, ecx, edx, sub_index; unsigned int ht_mask_width, core_plus_mask_width; unsigned int core_select_mask, core_level_siblings; @@ -116,22 +116,14 @@ void __cpuinit detect_extended_topology( core_select_mask = (~(-1 << core_plus_mask_width)) >> ht_mask_width; -#ifdef CONFIG_X86_32 - c->cpu_core_id = phys_pkg_id(c->initial_apicid, ht_mask_width) + c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, ht_mask_width) & core_select_mask; - c->phys_proc_id = phys_pkg_id(c->initial_apicid, core_plus_mask_width); + c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, core_plus_mask_width); /* * Reinit the apicid, now that we have extended initial_apicid. */ - c->apicid = phys_pkg_id(c->initial_apicid, 0); -#else - c->cpu_core_id = phys_pkg_id(ht_mask_width) & core_select_mask; - c->phys_proc_id = phys_pkg_id(core_plus_mask_width); - /* - * Reinit the apicid, now that we have extended initial_apicid. - */ - c->apicid = phys_pkg_id(0); -#endif + c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); + c->x86_max_cores = (core_level_siblings / smp_num_siblings); @@ -143,37 +135,3 @@ void __cpuinit detect_extended_topology( return; #endif } - -#ifdef CONFIG_X86_PAT -void __cpuinit validate_pat_support(struct cpuinfo_x86 *c) -{ - if (!cpu_has_pat) - pat_disable("PAT not supported by CPU."); - - switch (c->x86_vendor) { - case X86_VENDOR_INTEL: - /* - * There is a known erratum on Pentium III and Core Solo - * and Core Duo CPUs. - * " Page with PAT set to WC while associated MTRR is UC - * may consolidate to UC " - * Because of this erratum, it is better to stick with - * setting WC in MTRR rather than using PAT on these CPUs. - * - * Enable PAT WC only on P4, Core 2 or later CPUs. - */ - if (c->x86 > 0x6 || (c->x86 == 6 && c->x86_model >= 15)) - return; - - pat_disable("PAT WC disabled due to known CPU erratum."); - return; - - case X86_VENDOR_AMD: - case X86_VENDOR_CENTAUR: - case X86_VENDOR_TRANSMETA: - return; - } - - pat_disable("PAT disabled. Not yet verified on this CPU type."); -} -#endif Index: linux-2.6-tip/arch/x86/kernel/cpu/amd.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/amd.c +++ linux-2.6-tip/arch/x86/kernel/cpu/amd.c @@ -12,8 +12,6 @@ # include #endif -#include - #include "cpu.h" #ifdef CONFIG_X86_32 Index: linux-2.6-tip/arch/x86/kernel/cpu/common.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/common.c +++ linux-2.6-tip/arch/x86/kernel/cpu/common.c @@ -17,18 +17,19 @@ #include #include #include +#include #include #include #include #include -#ifdef CONFIG_X86_LOCAL_APIC -#include +#include +#include #include -#include -#include + +#ifdef CONFIG_X86_LOCAL_APIC +#include #endif -#include #include #include #include @@ -37,6 +38,7 @@ #include #include #include +#include #include "cpu.h" @@ -50,6 +52,15 @@ cpumask_var_t cpu_initialized_mask; /* representing cpus for which sibling maps can be computed */ cpumask_var_t cpu_sibling_setup_mask; +/* correctly size the local cpu masks */ +void __init setup_cpu_local_masks(void) +{ + alloc_bootmem_cpumask_var(&cpu_initialized_mask); + alloc_bootmem_cpumask_var(&cpu_callin_mask); + alloc_bootmem_cpumask_var(&cpu_callout_mask); + alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask); +} + #else /* CONFIG_X86_32 */ cpumask_t cpu_callin_map; @@ -62,23 +73,23 @@ cpumask_t cpu_sibling_setup_map; static struct cpu_dev *this_cpu __cpuinitdata; +DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { #ifdef CONFIG_X86_64 -/* We need valid kernel segments for data and code in long mode too - * IRET will check the segment types kkeil 2000/10/28 - * Also sysret mandates a special GDT layout - */ -/* The TLS descriptors are currently at a different place compared to i386. - Hopefully nobody expects them at a fixed place (Wine?) */ -DEFINE_PER_CPU(struct gdt_page, gdt_page) = { .gdt = { + /* + * We need valid kernel segments for data and code in long mode too + * IRET will check the segment types kkeil 2000/10/28 + * Also sysret mandates a special GDT layout + * + * The TLS descriptors are currently at a different place compared to i386. + * Hopefully nobody expects them at a fixed place (Wine?) + */ [GDT_ENTRY_KERNEL32_CS] = { { { 0x0000ffff, 0x00cf9b00 } } }, [GDT_ENTRY_KERNEL_CS] = { { { 0x0000ffff, 0x00af9b00 } } }, [GDT_ENTRY_KERNEL_DS] = { { { 0x0000ffff, 0x00cf9300 } } }, [GDT_ENTRY_DEFAULT_USER32_CS] = { { { 0x0000ffff, 0x00cffb00 } } }, [GDT_ENTRY_DEFAULT_USER_DS] = { { { 0x0000ffff, 0x00cff300 } } }, [GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00affb00 } } }, -} }; #else -DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { [GDT_ENTRY_KERNEL_CS] = { { { 0x0000ffff, 0x00cf9a00 } } }, [GDT_ENTRY_KERNEL_DS] = { { { 0x0000ffff, 0x00cf9200 } } }, [GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00cffa00 } } }, @@ -110,9 +121,10 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_p [GDT_ENTRY_APMBIOS_BASE+2] = { { { 0x0000ffff, 0x00409200 } } }, [GDT_ENTRY_ESPFIX_SS] = { { { 0x00000000, 0x00c09200 } } }, - [GDT_ENTRY_PERCPU] = { { { 0x00000000, 0x00000000 } } }, -} }; + [GDT_ENTRY_PERCPU] = { { { 0x0000ffff, 0x00cf9200 } } }, + GDT_STACK_CANARY_INIT #endif +} }; EXPORT_PER_CPU_SYMBOL_GPL(gdt_page); #ifdef CONFIG_X86_32 @@ -213,6 +225,49 @@ static inline void squash_the_stupid_ser #endif /* + * Some CPU features depend on higher CPUID levels, which may not always + * be available due to CPUID level capping or broken virtualization + * software. Add those features to this table to auto-disable them. + */ +struct cpuid_dependent_feature { + u32 feature; + u32 level; +}; +static const struct cpuid_dependent_feature __cpuinitconst +cpuid_dependent_features[] = { + { X86_FEATURE_MWAIT, 0x00000005 }, + { X86_FEATURE_DCA, 0x00000009 }, + { X86_FEATURE_XSAVE, 0x0000000d }, + { 0, 0 } +}; + +static void __cpuinit filter_cpuid_features(struct cpuinfo_x86 *c, bool warn) +{ + const struct cpuid_dependent_feature *df; + for (df = cpuid_dependent_features; df->feature; df++) { + /* + * Note: cpuid_level is set to -1 if unavailable, but + * extended_extended_level is set to 0 if unavailable + * and the legitimate extended levels are all negative + * when signed; hence the weird messing around with + * signs here... + */ + if (cpu_has(c, df->feature) && + ((s32)df->level < 0 ? + (u32)df->level > (u32)c->extended_cpuid_level : + (s32)df->level > (s32)c->cpuid_level)) { + clear_cpu_cap(c, df->feature); + if (warn) + printk(KERN_WARNING + "CPU: CPU feature %s disabled " + "due to lack of CPUID level 0x%x\n", + x86_cap_flags[df->feature], + df->level); + } + } +} + +/* * Naming convention should be: [()] * This table only is used unless init_() below doesn't set it; * in particular, if CPUID levels 0x80000002..4 are supported, this isn't used @@ -242,18 +297,29 @@ static char __cpuinit *table_lookup_mode __u32 cleared_cpu_caps[NCAPINTS] __cpuinitdata; +void load_percpu_segment(int cpu) +{ +#ifdef CONFIG_X86_32 + loadsegment(fs, __KERNEL_PERCPU); +#else + loadsegment(gs, 0); + wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu)); +#endif + load_stack_canary_segment(); +} + /* Current gdt points %fs at the "master" per-cpu area: after this, * it's on the real one. */ -void switch_to_new_gdt(void) +void switch_to_new_gdt(int cpu) { struct desc_ptr gdt_descr; - gdt_descr.address = (long)get_cpu_gdt_table(smp_processor_id()); + gdt_descr.address = (long)get_cpu_gdt_table(cpu); gdt_descr.size = GDT_SIZE - 1; load_gdt(&gdt_descr); -#ifdef CONFIG_X86_32 - asm("mov %0, %%fs" : : "r" (__KERNEL_PERCPU) : "memory"); -#endif + /* Reload the per-cpu base */ + + load_percpu_segment(cpu); } static struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {}; @@ -383,11 +449,7 @@ void __cpuinit detect_ht(struct cpuinfo_ } index_msb = get_count_order(smp_num_siblings); -#ifdef CONFIG_X86_64 - c->phys_proc_id = phys_pkg_id(index_msb); -#else - c->phys_proc_id = phys_pkg_id(c->initial_apicid, index_msb); -#endif + c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb); smp_num_siblings = smp_num_siblings / c->x86_max_cores; @@ -395,13 +457,8 @@ void __cpuinit detect_ht(struct cpuinfo_ core_bits = get_count_order(c->x86_max_cores); -#ifdef CONFIG_X86_64 - c->cpu_core_id = phys_pkg_id(index_msb) & + c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) & ((1 << core_bits) - 1); -#else - c->cpu_core_id = phys_pkg_id(c->initial_apicid, index_msb) & - ((1 << core_bits) - 1); -#endif } out: @@ -570,11 +627,10 @@ static void __init early_identify_cpu(st if (this_cpu->c_early_init) this_cpu->c_early_init(c); - validate_pat_support(c); - #ifdef CONFIG_SMP c->cpu_index = boot_cpu_id; #endif + filter_cpuid_features(c, false); } void __init early_cpu_init(void) @@ -637,7 +693,7 @@ static void __cpuinit generic_identify(s c->initial_apicid = (cpuid_ebx(1) >> 24) & 0xFF; #ifdef CONFIG_X86_32 # ifdef CONFIG_X86_HT - c->apicid = phys_pkg_id(c->initial_apicid, 0); + c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); # else c->apicid = c->initial_apicid; # endif @@ -684,7 +740,7 @@ static void __cpuinit identify_cpu(struc this_cpu->c_identify(c); #ifdef CONFIG_X86_64 - c->apicid = phys_pkg_id(0); + c->apicid = apic->phys_pkg_id(c->initial_apicid, 0); #endif /* @@ -708,6 +764,9 @@ static void __cpuinit identify_cpu(struc * we do "generic changes." */ + /* Filter out anything that depends on CPUID levels we don't have */ + filter_cpuid_features(c, true); + /* If the model name is still unset, do table lookup. */ if (!c->x86_model_id[0]) { char *p; @@ -772,6 +831,7 @@ void __init identify_boot_cpu(void) #else vgetcpu_set_mode(); #endif + init_hw_perf_counters(); } void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c) @@ -877,54 +937,22 @@ static __init int setup_disablecpuid(cha __setup("clearcpuid=", setup_disablecpuid); #ifdef CONFIG_X86_64 -struct x8664_pda **_cpu_pda __read_mostly; -EXPORT_SYMBOL(_cpu_pda); - struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table }; -static char boot_cpu_stack[IRQSTACKSIZE] __page_aligned_bss; - -void __cpuinit pda_init(int cpu) -{ - struct x8664_pda *pda = cpu_pda(cpu); - - /* Setup up data that may be needed in __get_free_pages early */ - loadsegment(fs, 0); - loadsegment(gs, 0); - /* Memory clobbers used to order PDA accessed */ - mb(); - wrmsrl(MSR_GS_BASE, pda); - mb(); - - pda->cpunumber = cpu; - pda->irqcount = -1; - pda->kernelstack = (unsigned long)stack_thread_info() - - PDA_STACKOFFSET + THREAD_SIZE; - pda->active_mm = &init_mm; - pda->mmu_state = 0; - - if (cpu == 0) { - /* others are initialized in smpboot.c */ - pda->pcurrent = &init_task; - pda->irqstackptr = boot_cpu_stack; - pda->irqstackptr += IRQSTACKSIZE - 64; - } else { - if (!pda->irqstackptr) { - pda->irqstackptr = (char *) - __get_free_pages(GFP_ATOMIC, IRQSTACK_ORDER); - if (!pda->irqstackptr) - panic("cannot allocate irqstack for cpu %d", - cpu); - pda->irqstackptr += IRQSTACKSIZE - 64; - } - - if (pda->nodenumber == 0 && cpu_to_node(cpu) != NUMA_NO_NODE) - pda->nodenumber = cpu_to_node(cpu); - } -} - -static char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + - DEBUG_STKSZ] __page_aligned_bss; +DEFINE_PER_CPU_FIRST(union irq_stack_union, + irq_stack_union) __aligned(PAGE_SIZE); +DEFINE_PER_CPU(char *, irq_stack_ptr) = + init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_STACK_SIZE - 64; + +DEFINE_PER_CPU(unsigned long, kernel_stack) = + (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE; +EXPORT_PER_CPU_SYMBOL(kernel_stack); + +DEFINE_PER_CPU(unsigned int, irq_count) = -1; + +static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks + [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]) + __aligned(PAGE_SIZE); extern asmlinkage void ignore_sysret(void); @@ -957,16 +985,21 @@ unsigned long kernel_eflags; */ DEFINE_PER_CPU(struct orig_ist, orig_ist); -#else +#else /* x86_64 */ -/* Make sure %fs is initialized properly in idle threads */ +#ifdef CONFIG_CC_STACKPROTECTOR +DEFINE_PER_CPU(unsigned long, stack_canary); +#endif + +/* Make sure %fs and %gs are initialized properly in idle threads */ struct pt_regs * __cpuinit idle_regs(struct pt_regs *regs) { memset(regs, 0, sizeof(struct pt_regs)); regs->fs = __KERNEL_PERCPU; + regs->gs = __KERNEL_STACK_CANARY; return regs; } -#endif +#endif /* x86_64 */ /* * cpu_init() initializes state that is per-CPU. Some data is already @@ -982,15 +1015,14 @@ void __cpuinit cpu_init(void) struct tss_struct *t = &per_cpu(init_tss, cpu); struct orig_ist *orig_ist = &per_cpu(orig_ist, cpu); unsigned long v; - char *estacks = NULL; struct task_struct *me; int i; - /* CPU 0 is initialised in head64.c */ - if (cpu != 0) - pda_init(cpu); - else - estacks = boot_exception_stacks; +#ifdef CONFIG_NUMA + if (cpu != 0 && percpu_read(node_number) == 0 && + cpu_to_node(cpu) != NUMA_NO_NODE) + percpu_write(node_number, cpu_to_node(cpu)); +#endif me = current; @@ -1006,7 +1038,9 @@ void __cpuinit cpu_init(void) * and set up the GDT descriptor: */ - switch_to_new_gdt(); + switch_to_new_gdt(cpu); + loadsegment(fs, 0); + load_idt((const struct desc_ptr *)&idt_descr); memset(me->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8); @@ -1017,25 +1051,22 @@ void __cpuinit cpu_init(void) barrier(); check_efer(); - if (cpu != 0 && x2apic) + if (cpu != 0) enable_x2apic(); /* * set up and load the per-CPU TSS */ if (!orig_ist->ist[0]) { - static const unsigned int order[N_EXCEPTION_STACKS] = { - [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, - [DEBUG_STACK - 1] = DEBUG_STACK_ORDER + static const unsigned int sizes[N_EXCEPTION_STACKS] = { + [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STKSZ, +#if DEBUG_STACK > 0 + [DEBUG_STACK - 1] = DEBUG_STKSZ +#endif }; + char *estacks = per_cpu(exception_stacks, cpu); for (v = 0; v < N_EXCEPTION_STACKS; v++) { - if (cpu) { - estacks = (char *)__get_free_pages(GFP_ATOMIC, order[v]); - if (!estacks) - panic("Cannot allocate exception " - "stack %ld %d\n", v, cpu); - } - estacks += PAGE_SIZE << order[v]; + estacks += sizes[v]; orig_ist->ist[v] = t->x86_tss.ist[v] = (unsigned long)estacks; } @@ -1069,22 +1100,19 @@ void __cpuinit cpu_init(void) */ if (kgdb_connected && arch_kgdb_ops.correct_hw_break) arch_kgdb_ops.correct_hw_break(); - else { + else #endif - /* - * Clear all 6 debug registers: - */ - - set_debugreg(0UL, 0); - set_debugreg(0UL, 1); - set_debugreg(0UL, 2); - set_debugreg(0UL, 3); - set_debugreg(0UL, 6); - set_debugreg(0UL, 7); -#ifdef CONFIG_KGDB - /* If the kgdb is connected no debug regs should be altered. */ + { + /* + * Clear all 6 debug registers: + */ + set_debugreg(0UL, 0); + set_debugreg(0UL, 1); + set_debugreg(0UL, 2); + set_debugreg(0UL, 3); + set_debugreg(0UL, 6); + set_debugreg(0UL, 7); } -#endif fpu_init(); @@ -1114,7 +1142,7 @@ void __cpuinit cpu_init(void) clear_in_cr4(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE); load_idt(&idt_descr); - switch_to_new_gdt(); + switch_to_new_gdt(cpu); /* * Set up and load the per-CPU TSS and LDT @@ -1135,9 +1163,6 @@ void __cpuinit cpu_init(void) __set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss); #endif - /* Clear %gs. */ - asm volatile ("mov %0, %%gs" : : "r" (0)); - /* Clear all 6 debug registers: */ set_debugreg(0, 0); set_debugreg(0, 1); Index: linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c +++ linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c @@ -33,7 +33,7 @@ #include #include #include -#include +#include #include #include @@ -70,6 +70,8 @@ struct acpi_cpufreq_data { static DEFINE_PER_CPU(struct acpi_cpufreq_data *, drv_data); +DEFINE_TRACE(power_mark); + /* acpi_perf_data is a pointer to percpu data. */ static struct acpi_processor_performance *acpi_perf_data; Index: linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/e_powersaver.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/cpufreq/e_powersaver.c +++ linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/e_powersaver.c @@ -204,12 +204,12 @@ static int eps_cpu_init(struct cpufreq_p } /* Enable Enhanced PowerSaver */ rdmsrl(MSR_IA32_MISC_ENABLE, val); - if (!(val & 1 << 16)) { - val |= 1 << 16; + if (!(val & MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP)) { + val |= MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP; wrmsrl(MSR_IA32_MISC_ENABLE, val); /* Can be locked at 0 */ rdmsrl(MSR_IA32_MISC_ENABLE, val); - if (!(val & 1 << 16)) { + if (!(val & MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP)) { printk(KERN_INFO "eps: Can't enable Enhanced PowerSaver\n"); return -ENODEV; } Index: linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/speedstep-centrino.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/cpufreq/speedstep-centrino.c +++ linux-2.6-tip/arch/x86/kernel/cpu/cpufreq/speedstep-centrino.c @@ -390,14 +390,14 @@ static int centrino_cpu_init(struct cpuf enable it if not. */ rdmsr(MSR_IA32_MISC_ENABLE, l, h); - if (!(l & (1<<16))) { - l |= (1<<16); + if (!(l & MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP)) { + l |= MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP; dprintk("trying to enable Enhanced SpeedStep (%x)\n", l); wrmsr(MSR_IA32_MISC_ENABLE, l, h); /* check to see if it stuck */ rdmsr(MSR_IA32_MISC_ENABLE, l, h); - if (!(l & (1<<16))) { + if (!(l & MSR_IA32_MISC_ENABLE_ENHANCED_SPEEDSTEP)) { printk(KERN_INFO PFX "couldn't enable Enhanced SpeedStep\n"); return -ENODEV; Index: linux-2.6-tip/arch/x86/kernel/cpu/intel.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/intel.c +++ linux-2.6-tip/arch/x86/kernel/cpu/intel.c @@ -24,7 +24,6 @@ #ifdef CONFIG_X86_LOCAL_APIC #include #include -#include #endif static void __cpuinit early_init_intel(struct cpuinfo_x86 *c) @@ -63,6 +62,41 @@ static void __cpuinit early_init_intel(s set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); } + /* + * There is a known erratum on Pentium III and Core Solo + * and Core Duo CPUs. + * " Page with PAT set to WC while associated MTRR is UC + * may consolidate to UC " + * Because of this erratum, it is better to stick with + * setting WC in MTRR rather than using PAT on these CPUs. + * + * Enable PAT WC only on P4, Core 2 or later CPUs. + */ + if (c->x86 == 6 && c->x86_model < 15) + clear_cpu_cap(c, X86_FEATURE_PAT); + +#ifdef CONFIG_KMEMCHECK + /* + * P4s have a "fast strings" feature which causes single- + * stepping REP instructions to only generate a #DB on + * cache-line boundaries. + * + * Ingo Molnar reported a Pentium D (model 6) and a Xeon + * (model 2) with the same problem. + */ + if (c->x86 == 15) { + u64 misc_enable; + + rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable); + + if (misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING) { + printk(KERN_INFO "kmemcheck: Disabling fast string operations\n"); + + misc_enable &= ~MSR_IA32_MISC_ENABLE_FAST_STRING; + wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable); + } + } +#endif } #ifdef CONFIG_X86_32 @@ -135,10 +169,10 @@ static void __cpuinit intel_workarounds( */ if ((c->x86 == 15) && (c->x86_model == 1) && (c->x86_mask == 1)) { rdmsr(MSR_IA32_MISC_ENABLE, lo, hi); - if ((lo & (1<<9)) == 0) { + if ((lo & MSR_IA32_MISC_ENABLE_PREFETCH_DISABLE) == 0) { printk (KERN_INFO "CPU: C0 stepping P4 Xeon detected.\n"); printk (KERN_INFO "CPU: Disabling hardware prefetching (Errata 037)\n"); - lo |= (1<<9); /* Disable hw prefetching */ + lo |= MSR_IA32_MISC_ENABLE_PREFETCH_DISABLE; wrmsr (MSR_IA32_MISC_ENABLE, lo, hi); } } Index: linux-2.6-tip/arch/x86/kernel/cpu/intel_cacheinfo.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/intel_cacheinfo.c +++ linux-2.6-tip/arch/x86/kernel/cpu/intel_cacheinfo.c @@ -147,10 +147,19 @@ struct _cpuid4_info { union _cpuid4_leaf_ecx ecx; unsigned long size; unsigned long can_disable; - cpumask_t shared_cpu_map; /* future?: only cpus/node is needed */ + DECLARE_BITMAP(shared_cpu_map, NR_CPUS); }; -#ifdef CONFIG_PCI +/* subset of above _cpuid4_info w/o shared_cpu_map */ +struct _cpuid4_info_regs { + union _cpuid4_leaf_eax eax; + union _cpuid4_leaf_ebx ebx; + union _cpuid4_leaf_ecx ecx; + unsigned long size; + unsigned long can_disable; +}; + +#if defined(CONFIG_PCI) && defined(CONFIG_SYSFS) static struct pci_device_id k8_nb_id[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x1103) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x1203) }, @@ -278,7 +287,7 @@ amd_cpuid4(int leaf, union _cpuid4_leaf_ } static void __cpuinit -amd_check_l3_disable(int index, struct _cpuid4_info *this_leaf) +amd_check_l3_disable(int index, struct _cpuid4_info_regs *this_leaf) { if (index < 3) return; @@ -286,7 +295,8 @@ amd_check_l3_disable(int index, struct _ } static int -__cpuinit cpuid4_cache_lookup(int index, struct _cpuid4_info *this_leaf) +__cpuinit cpuid4_cache_lookup_regs(int index, + struct _cpuid4_info_regs *this_leaf) { union _cpuid4_leaf_eax eax; union _cpuid4_leaf_ebx ebx; @@ -353,11 +363,10 @@ unsigned int __cpuinit init_intel_cachei * parameters cpuid leaf to find the cache details */ for (i = 0; i < num_cache_leaves; i++) { - struct _cpuid4_info this_leaf; - + struct _cpuid4_info_regs this_leaf; int retval; - retval = cpuid4_cache_lookup(i, &this_leaf); + retval = cpuid4_cache_lookup_regs(i, &this_leaf); if (retval >= 0) { switch(this_leaf.eax.split.level) { case 1: @@ -490,6 +499,8 @@ unsigned int __cpuinit init_intel_cachei return l2; } +#ifdef CONFIG_SYSFS + /* pointer to _cpuid4_info array (for each cache leaf) */ static DEFINE_PER_CPU(struct _cpuid4_info *, cpuid4_info); #define CPUID4_INFO_IDX(x, y) (&((per_cpu(cpuid4_info, x))[y])) @@ -506,17 +517,20 @@ static void __cpuinit cache_shared_cpu_m num_threads_sharing = 1 + this_leaf->eax.split.num_threads_sharing; if (num_threads_sharing == 1) - cpu_set(cpu, this_leaf->shared_cpu_map); + cpumask_set_cpu(cpu, to_cpumask(this_leaf->shared_cpu_map)); else { index_msb = get_count_order(num_threads_sharing); for_each_online_cpu(i) { if (cpu_data(i).apicid >> index_msb == c->apicid >> index_msb) { - cpu_set(i, this_leaf->shared_cpu_map); + cpumask_set_cpu(i, + to_cpumask(this_leaf->shared_cpu_map)); if (i != cpu && per_cpu(cpuid4_info, i)) { - sibling_leaf = CPUID4_INFO_IDX(i, index); - cpu_set(cpu, sibling_leaf->shared_cpu_map); + sibling_leaf = + CPUID4_INFO_IDX(i, index); + cpumask_set_cpu(cpu, to_cpumask( + sibling_leaf->shared_cpu_map)); } } } @@ -528,9 +542,10 @@ static void __cpuinit cache_remove_share int sibling; this_leaf = CPUID4_INFO_IDX(cpu, index); - for_each_cpu_mask_nr(sibling, this_leaf->shared_cpu_map) { + for_each_cpu(sibling, to_cpumask(this_leaf->shared_cpu_map)) { sibling_leaf = CPUID4_INFO_IDX(sibling, index); - cpu_clear(cpu, sibling_leaf->shared_cpu_map); + cpumask_clear_cpu(cpu, + to_cpumask(sibling_leaf->shared_cpu_map)); } } #else @@ -549,6 +564,15 @@ static void __cpuinit free_cache_attribu per_cpu(cpuid4_info, cpu) = NULL; } +static int +__cpuinit cpuid4_cache_lookup(int index, struct _cpuid4_info *this_leaf) +{ + struct _cpuid4_info_regs *leaf_regs = + (struct _cpuid4_info_regs *)this_leaf; + + return cpuid4_cache_lookup_regs(index, leaf_regs); +} + static void __cpuinit get_cpu_leaves(void *_retval) { int j, *retval = _retval, cpu = smp_processor_id(); @@ -590,8 +614,6 @@ static int __cpuinit detect_cache_attrib return retval; } -#ifdef CONFIG_SYSFS - #include #include @@ -635,8 +657,9 @@ static ssize_t show_shared_cpu_map_func( int n = 0; if (len > 1) { - cpumask_t *mask = &this_leaf->shared_cpu_map; + const struct cpumask *mask; + mask = to_cpumask(this_leaf->shared_cpu_map); n = type? cpulist_scnprintf(buf, len-2, mask) : cpumask_scnprintf(buf, len-2, mask); @@ -699,7 +722,8 @@ static struct pci_dev *get_k8_northbridg static ssize_t show_cache_disable(struct _cpuid4_info *this_leaf, char *buf) { - int node = cpu_to_node(first_cpu(this_leaf->shared_cpu_map)); + const struct cpumask *mask = to_cpumask(this_leaf->shared_cpu_map); + int node = cpu_to_node(cpumask_first(mask)); struct pci_dev *dev = NULL; ssize_t ret = 0; int i; @@ -733,7 +757,8 @@ static ssize_t store_cache_disable(struct _cpuid4_info *this_leaf, const char *buf, size_t count) { - int node = cpu_to_node(first_cpu(this_leaf->shared_cpu_map)); + const struct cpumask *mask = to_cpumask(this_leaf->shared_cpu_map); + int node = cpu_to_node(cpumask_first(mask)); struct pci_dev *dev = NULL; unsigned int ret, index, val; @@ -878,7 +903,7 @@ err_out: return -ENOMEM; } -static cpumask_t cache_dev_map = CPU_MASK_NONE; +static DECLARE_BITMAP(cache_dev_map, NR_CPUS); /* Add/Remove cache interface for CPU device */ static int __cpuinit cache_add_dev(struct sys_device * sys_dev) @@ -918,7 +943,7 @@ static int __cpuinit cache_add_dev(struc } kobject_uevent(&(this_object->kobj), KOBJ_ADD); } - cpu_set(cpu, cache_dev_map); + cpumask_set_cpu(cpu, to_cpumask(cache_dev_map)); kobject_uevent(per_cpu(cache_kobject, cpu), KOBJ_ADD); return 0; @@ -931,9 +956,9 @@ static void __cpuinit cache_remove_dev(s if (per_cpu(cpuid4_info, cpu) == NULL) return; - if (!cpu_isset(cpu, cache_dev_map)) + if (!cpumask_test_cpu(cpu, to_cpumask(cache_dev_map))) return; - cpu_clear(cpu, cache_dev_map); + cpumask_clear_cpu(cpu, to_cpumask(cache_dev_map)); for (i = 0; i < num_cache_leaves; i++) kobject_put(&(INDEX_KOBJECT_PTR(cpu,i)->kobj)); Index: linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mcheck/mce_32.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_32.c @@ -60,20 +60,6 @@ void mcheck_init(struct cpuinfo_x86 *c) } } -static unsigned long old_cr4 __initdata; - -void __init stop_mce(void) -{ - old_cr4 = read_cr4(); - clear_in_cr4(X86_CR4_MCE); -} - -void __init restart_mce(void) -{ - if (old_cr4 & X86_CR4_MCE) - set_in_cr4(X86_CR4_MCE); -} - static int __init mcheck_disable(char *str) { mce_disabled = 1; Index: linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mcheck/mce_64.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_64.c @@ -151,6 +151,8 @@ static void mce_panic(char *msg, struct static int mce_available(struct cpuinfo_x86 *c) { + if (mce_dont_init) + return 0; return cpu_has(c, X86_FEATURE_MCE) && cpu_has(c, X86_FEATURE_MCA); } @@ -353,18 +355,17 @@ void mce_log_therm_throt_event(unsigned static int check_interval = 5 * 60; /* 5 minutes */ static int next_interval; /* in jiffies */ -static void mcheck_timer(struct work_struct *work); -static DECLARE_DELAYED_WORK(mcheck_work, mcheck_timer); +static void mcheck_timer(unsigned long); +static DEFINE_PER_CPU(struct timer_list, mce_timer); -static void mcheck_check_cpu(void *info) +static void mcheck_timer(unsigned long data) { + struct timer_list *t = &per_cpu(mce_timer, data); + + WARN_ON(smp_processor_id() != data); + if (mce_available(¤t_cpu_data)) do_machine_check(NULL, 0); -} - -static void mcheck_timer(struct work_struct *work) -{ - on_each_cpu(mcheck_check_cpu, NULL, 1); /* * Alert userspace if needed. If we logged an MCE, reduce the @@ -377,14 +378,21 @@ static void mcheck_timer(struct work_str (int)round_jiffies_relative(check_interval*HZ)); } - schedule_delayed_work(&mcheck_work, next_interval); + t->expires = jiffies + next_interval; + add_timer(t); } +static void mce_do_trigger(struct work_struct *work) +{ + call_usermodehelper(trigger, trigger_argv, NULL, UMH_NO_WAIT); +} + +static DECLARE_WORK(mce_trigger_work, mce_do_trigger); + /* - * This is only called from process context. This is where we do - * anything we need to alert userspace about new MCEs. This is called - * directly from the poller and also from entry.S and idle, thanks to - * TIF_MCE_NOTIFY. + * Notify the user(s) about new machine check events. + * Can be called from interrupt context, but not from machine check/NMI + * context. */ int mce_notify_user(void) { @@ -394,9 +402,14 @@ int mce_notify_user(void) unsigned long now = jiffies; wake_up_interruptible(&mce_wait); - if (trigger[0]) - call_usermodehelper(trigger, trigger_argv, NULL, - UMH_NO_WAIT); + + /* + * There is no risk of missing notifications because + * work_pending is always cleared before the function is + * executed. + */ + if (trigger[0] && !work_pending(&mce_trigger_work)) + schedule_work(&mce_trigger_work); if (time_after_eq(now, last_print + (check_interval*HZ))) { last_print = now; @@ -425,16 +438,11 @@ static struct notifier_block mce_idle_no static __init int periodic_mcheck_init(void) { - next_interval = check_interval * HZ; - if (next_interval) - schedule_delayed_work(&mcheck_work, - round_jiffies_relative(next_interval)); - idle_notifier_register(&mce_idle_notifier); - return 0; + idle_notifier_register(&mce_idle_notifier); + return 0; } __initcall(periodic_mcheck_init); - /* * Initialize Machine Checks for a CPU. */ @@ -504,6 +512,20 @@ static void mce_cpu_features(struct cpui } } +static void mce_init_timer(void) +{ + struct timer_list *t = &__get_cpu_var(mce_timer); + + /* data race harmless because everyone sets to the same value */ + if (!next_interval) + next_interval = check_interval * HZ; + if (!next_interval) + return; + setup_timer(t, mcheck_timer, smp_processor_id()); + t->expires = round_jiffies_relative(jiffies + next_interval); + add_timer(t); +} + /* * Called for each booted CPU to set up machine checks. * Must be called with preempt off. @@ -512,12 +534,12 @@ void __cpuinit mcheck_init(struct cpuinf { mce_cpu_quirks(c); - if (mce_dont_init || - !mce_available(c)) + if (!mce_available(c)) return; mce_init(NULL); mce_cpu_features(c); + mce_init_timer(); } /* @@ -573,7 +595,7 @@ static ssize_t mce_read(struct file *fil { unsigned long *cpu_tsc; static DEFINE_MUTEX(mce_read_mutex); - unsigned next; + unsigned prev, next; char __user *buf = ubuf; int i, err; @@ -592,25 +614,32 @@ static ssize_t mce_read(struct file *fil } err = 0; - for (i = 0; i < next; i++) { - unsigned long start = jiffies; - - while (!mcelog.entry[i].finished) { - if (time_after_eq(jiffies, start + 2)) { - memset(mcelog.entry + i,0, sizeof(struct mce)); - goto timeout; + prev = 0; + do { + for (i = prev; i < next; i++) { + unsigned long start = jiffies; + + while (!mcelog.entry[i].finished) { + if (time_after_eq(jiffies, start + 2)) { + memset(mcelog.entry + i, 0, + sizeof(struct mce)); + goto timeout; + } + cpu_relax(); } - cpu_relax(); + smp_rmb(); + err |= copy_to_user(buf, mcelog.entry + i, + sizeof(struct mce)); + buf += sizeof(struct mce); +timeout: + ; } - smp_rmb(); - err |= copy_to_user(buf, mcelog.entry + i, sizeof(struct mce)); - buf += sizeof(struct mce); - timeout: - ; - } - memset(mcelog.entry, 0, next * sizeof(struct mce)); - mcelog.next = 0; + memset(mcelog.entry + prev, 0, + (next - prev) * sizeof(struct mce)); + prev = next; + next = cmpxchg(&mcelog.next, prev, 0); + } while (next != prev); synchronize_sched(); @@ -680,20 +709,6 @@ static struct miscdevice mce_log_device &mce_chrdev_ops, }; -static unsigned long old_cr4 __initdata; - -void __init stop_mce(void) -{ - old_cr4 = read_cr4(); - clear_in_cr4(X86_CR4_MCE); -} - -void __init restart_mce(void) -{ - if (old_cr4 & X86_CR4_MCE) - set_in_cr4(X86_CR4_MCE); -} - /* * Old style boot options parsing. Only for compatibility. */ @@ -703,8 +718,7 @@ static int __init mcheck_disable(char *s return 1; } -/* mce=off disables machine check. Note you can re-enable it later - using sysfs. +/* mce=off disables machine check. mce=TOLERANCELEVEL (number, see above) mce=bootlog Log MCEs from before booting. Disabled by default on AMD. mce=nobootlog Don't log MCEs from before booting. */ @@ -728,6 +742,29 @@ __setup("mce=", mcheck_enable); * Sysfs support */ +/* + * Disable machine checks on suspend and shutdown. We can't really handle + * them later. + */ +static int mce_disable(void) +{ + int i; + + for (i = 0; i < banks; i++) + wrmsrl(MSR_IA32_MC0_CTL + i*4, 0); + return 0; +} + +static int mce_suspend(struct sys_device *dev, pm_message_t state) +{ + return mce_disable(); +} + +static int mce_shutdown(struct sys_device *dev) +{ + return mce_disable(); +} + /* On resume clear all MCE state. Don't want to see leftovers from the BIOS. Only one CPU is active at this time, the others get readded later using CPU hotplug. */ @@ -738,20 +775,24 @@ static int mce_resume(struct sys_device return 0; } +static void mce_cpu_restart(void *data) +{ + del_timer_sync(&__get_cpu_var(mce_timer)); + if (mce_available(¤t_cpu_data)) + mce_init(NULL); + mce_init_timer(); +} + /* Reinit MCEs after user configuration changes */ static void mce_restart(void) { - if (next_interval) - cancel_delayed_work(&mcheck_work); - /* Timer race is harmless here */ - on_each_cpu(mce_init, NULL, 1); next_interval = check_interval * HZ; - if (next_interval) - schedule_delayed_work(&mcheck_work, - round_jiffies_relative(next_interval)); + on_each_cpu(mce_cpu_restart, NULL, 1); } static struct sysdev_class mce_sysclass = { + .suspend = mce_suspend, + .shutdown = mce_shutdown, .resume = mce_resume, .name = "machinecheck", }; @@ -872,11 +913,33 @@ static __cpuinit void mce_remove_device( cpu_clear(cpu, mce_device_initialized); } +/* Make sure there are no machine checks on offlined CPUs. */ +static void __cpuexit mce_disable_cpu(void *h) +{ + int i; + + if (!mce_available(¤t_cpu_data)) + return; + for (i = 0; i < banks; i++) + wrmsrl(MSR_IA32_MC0_CTL + i*4, 0); +} + +static void __cpuexit mce_reenable_cpu(void *h) +{ + int i; + + if (!mce_available(¤t_cpu_data)) + return; + for (i = 0; i < banks; i++) + wrmsrl(MSR_IA32_MC0_CTL + i*4, bank[i]); +} + /* Get notified when a cpu comes on/off. Be hotplug friendly. */ static int __cpuinit mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { unsigned int cpu = (unsigned long)hcpu; + struct timer_list *t = &per_cpu(mce_timer, cpu); switch (action) { case CPU_ONLINE: @@ -891,6 +954,17 @@ static int __cpuinit mce_cpu_callback(st threshold_cpu_callback(action, cpu); mce_remove_device(cpu); break; + case CPU_DOWN_PREPARE: + case CPU_DOWN_PREPARE_FROZEN: + del_timer_sync(t); + smp_call_function_single(cpu, mce_disable_cpu, NULL, 1); + break; + case CPU_DOWN_FAILED: + case CPU_DOWN_FAILED_FROZEN: + t->expires = round_jiffies_relative(jiffies + next_interval); + add_timer_on(t, cpu); + smp_call_function_single(cpu, mce_reenable_cpu, NULL, 1); + break; } return NOTIFY_OK; } Index: linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_amd_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -67,7 +67,7 @@ static struct threshold_block threshold_ struct threshold_bank { struct kobject *kobj; struct threshold_block *blocks; - cpumask_t cpus; + cpumask_var_t cpus; }; static DEFINE_PER_CPU(struct threshold_bank *, threshold_banks[NR_BANKS]); @@ -481,7 +481,7 @@ static __cpuinit int threshold_create_ba #ifdef CONFIG_SMP if (cpu_data(cpu).cpu_core_id && shared_bank[bank]) { /* symlink */ - i = first_cpu(per_cpu(cpu_core_map, cpu)); + i = cpumask_first(&per_cpu(cpu_core_map, cpu)); /* first core not up yet */ if (cpu_data(i).cpu_core_id) @@ -501,7 +501,7 @@ static __cpuinit int threshold_create_ba if (err) goto out; - b->cpus = per_cpu(cpu_core_map, cpu); + cpumask_copy(b->cpus, &per_cpu(cpu_core_map, cpu)); per_cpu(threshold_banks, cpu)[bank] = b; goto out; } @@ -512,15 +512,20 @@ static __cpuinit int threshold_create_ba err = -ENOMEM; goto out; } + if (!alloc_cpumask_var(&b->cpus, GFP_KERNEL)) { + kfree(b); + err = -ENOMEM; + goto out; + } b->kobj = kobject_create_and_add(name, &per_cpu(device_mce, cpu).kobj); if (!b->kobj) goto out_free; #ifndef CONFIG_SMP - b->cpus = CPU_MASK_ALL; + cpumask_setall(b->cpus); #else - b->cpus = per_cpu(cpu_core_map, cpu); + cpumask_copy(b->cpus, &per_cpu(cpu_core_map, cpu)); #endif per_cpu(threshold_banks, cpu)[bank] = b; @@ -529,7 +534,7 @@ static __cpuinit int threshold_create_ba if (err) goto out_free; - for_each_cpu_mask_nr(i, b->cpus) { + for_each_cpu(i, b->cpus) { if (i == cpu) continue; @@ -545,6 +550,7 @@ static __cpuinit int threshold_create_ba out_free: per_cpu(threshold_banks, cpu)[bank] = NULL; + free_cpumask_var(b->cpus); kfree(b); out: return err; @@ -619,7 +625,7 @@ static void threshold_remove_bank(unsign #endif /* remove all sibling symlinks before unregistering */ - for_each_cpu_mask_nr(i, b->cpus) { + for_each_cpu(i, b->cpus) { if (i == cpu) continue; @@ -632,6 +638,7 @@ static void threshold_remove_bank(unsign free_out: kobject_del(b->kobj); kobject_put(b->kobj); + free_cpumask_var(b->cpus); kfree(b); per_cpu(threshold_banks, cpu)[bank] = NULL; } Index: linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_intel_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mcheck/mce_intel_64.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -48,13 +49,13 @@ static void intel_init_thermal(struct cp */ rdmsr(MSR_IA32_MISC_ENABLE, l, h); h = apic_read(APIC_LVTTHMR); - if ((l & (1 << 3)) && (h & APIC_DM_SMI)) { + if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) { printk(KERN_DEBUG "CPU%d: Thermal monitoring handled by SMI\n", cpu); return; } - if (cpu_has(c, X86_FEATURE_TM2) && (l & (1 << 13))) + if (cpu_has(c, X86_FEATURE_TM2) && (l & MSR_IA32_MISC_ENABLE_TM2)) tm2 = 1; if (h & APIC_VECTOR_MASK) { @@ -72,7 +73,7 @@ static void intel_init_thermal(struct cp wrmsr(MSR_IA32_THERM_INTERRUPT, l | 0x03, h); rdmsr(MSR_IA32_MISC_ENABLE, l, h); - wrmsr(MSR_IA32_MISC_ENABLE, l | (1 << 3), h); + wrmsr(MSR_IA32_MISC_ENABLE, l | MSR_IA32_MISC_ENABLE_TM1, h); l = apic_read(APIC_LVTTHMR); apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED); Index: linux-2.6-tip/arch/x86/kernel/cpu/mcheck/p4.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mcheck/p4.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mcheck/p4.c @@ -85,7 +85,7 @@ static void intel_init_thermal(struct cp */ rdmsr(MSR_IA32_MISC_ENABLE, l, h); h = apic_read(APIC_LVTTHMR); - if ((l & (1<<3)) && (h & APIC_DM_SMI)) { + if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) { printk(KERN_DEBUG "CPU%d: Thermal monitoring handled by SMI\n", cpu); return; /* -EBUSY */ @@ -111,7 +111,7 @@ static void intel_init_thermal(struct cp vendor_thermal_interrupt = intel_thermal_interrupt; rdmsr(MSR_IA32_MISC_ENABLE, l, h); - wrmsr(MSR_IA32_MISC_ENABLE, l | (1<<3), h); + wrmsr(MSR_IA32_MISC_ENABLE, l | MSR_IA32_MISC_ENABLE_TM1, h); l = apic_read(APIC_LVTTHMR); apic_write(APIC_LVTTHMR, l & ~APIC_LVT_MASKED); Index: linux-2.6-tip/arch/x86/kernel/cpu/perf_counter.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/cpu/perf_counter.c @@ -0,0 +1,733 @@ +/* + * Performance counter x86 architecture code + * + * Copyright(C) 2008 Thomas Gleixner + * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar + * + * For licencing details see kernel-base/COPYING + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +static bool perf_counters_initialized __read_mostly; + +/* + * Number of (generic) HW counters: + */ +static int nr_counters_generic __read_mostly; +static u64 perf_counter_mask __read_mostly; +static u64 counter_value_mask __read_mostly; + +static int nr_counters_fixed __read_mostly; + +struct cpu_hw_counters { + struct perf_counter *counters[X86_PMC_IDX_MAX]; + unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)]; + unsigned long interrupts; + u64 global_enable; +}; + +/* + * Intel PerfMon v3. Used on Core2 and later. + */ +static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters); + +static const int intel_perfmon_event_map[] = +{ + [PERF_COUNT_CPU_CYCLES] = 0x003c, + [PERF_COUNT_INSTRUCTIONS] = 0x00c0, + [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e, + [PERF_COUNT_CACHE_MISSES] = 0x412e, + [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4, + [PERF_COUNT_BRANCH_MISSES] = 0x00c5, + [PERF_COUNT_BUS_CYCLES] = 0x013c, +}; + +static const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map); + +/* + * Propagate counter elapsed time into the generic counter. + * Can only be executed on the CPU where the counter is active. + * Returns the delta events processed. + */ +static void +x86_perf_counter_update(struct perf_counter *counter, + struct hw_perf_counter *hwc, int idx) +{ + u64 prev_raw_count, new_raw_count, delta; + + /* + * Careful: an NMI might modify the previous counter value. + * + * Our tactic to handle this is to first atomically read and + * exchange a new raw count - then add that new-prev delta + * count to the generic counter atomically: + */ +again: + prev_raw_count = atomic64_read(&hwc->prev_count); + rdmsrl(hwc->counter_base + idx, new_raw_count); + + if (atomic64_cmpxchg(&hwc->prev_count, prev_raw_count, + new_raw_count) != prev_raw_count) + goto again; + + /* + * Now we have the new raw value and have updated the prev + * timestamp already. We can now calculate the elapsed delta + * (counter-)time and add that to the generic counter. + * + * Careful, not all hw sign-extends above the physical width + * of the count, so we do that by clipping the delta to 32 bits: + */ + delta = (u64)(u32)((s32)new_raw_count - (s32)prev_raw_count); + + atomic64_add(delta, &counter->count); + atomic64_sub(delta, &hwc->period_left); +} + +/* + * Setup the hardware configuration for a given hw_event_type + */ +static int __hw_perf_counter_init(struct perf_counter *counter) +{ + struct perf_counter_hw_event *hw_event = &counter->hw_event; + struct hw_perf_counter *hwc = &counter->hw; + + if (unlikely(!perf_counters_initialized)) + return -EINVAL; + + /* + * Generate PMC IRQs: + * (keep 'enabled' bit clear for now) + */ + hwc->config = ARCH_PERFMON_EVENTSEL_INT; + + /* + * Count user and OS events unless requested not to. + */ + if (!hw_event->exclude_user) + hwc->config |= ARCH_PERFMON_EVENTSEL_USR; + if (!hw_event->exclude_kernel) + hwc->config |= ARCH_PERFMON_EVENTSEL_OS; + + /* + * If privileged enough, allow NMI events: + */ + hwc->nmi = 0; + if (capable(CAP_SYS_ADMIN) && hw_event->nmi) + hwc->nmi = 1; + + hwc->irq_period = hw_event->irq_period; + /* + * Intel PMCs cannot be accessed sanely above 32 bit width, + * so we install an artificial 1<<31 period regardless of + * the generic counter period: + */ + if ((s64)hwc->irq_period <= 0 || hwc->irq_period > 0x7FFFFFFF) + hwc->irq_period = 0x7FFFFFFF; + + atomic64_set(&hwc->period_left, hwc->irq_period); + + /* + * Raw event type provide the config in the event structure + */ + if (hw_event->raw) { + hwc->config |= hw_event->type; + } else { + if (hw_event->type >= max_intel_perfmon_events) + return -EINVAL; + /* + * The generic map: + */ + hwc->config |= intel_perfmon_event_map[hw_event->type]; + } + counter->wakeup_pending = 0; + + return 0; +} + +u64 hw_perf_save_disable(void) +{ + u64 ctrl; + + if (unlikely(!perf_counters_initialized)) + return 0; + + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl); + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); + + return ctrl; +} +EXPORT_SYMBOL_GPL(hw_perf_save_disable); + +void hw_perf_restore(u64 ctrl) +{ + if (unlikely(!perf_counters_initialized)) + return; + + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl); +} +EXPORT_SYMBOL_GPL(hw_perf_restore); + +static inline void +__pmc_fixed_disable(struct perf_counter *counter, + struct hw_perf_counter *hwc, unsigned int __idx) +{ + int idx = __idx - X86_PMC_IDX_FIXED; + u64 ctrl_val, mask; + int err; + + mask = 0xfULL << (idx * 4); + + rdmsrl(hwc->config_base, ctrl_val); + ctrl_val &= ~mask; + err = checking_wrmsrl(hwc->config_base, ctrl_val); +} + +static inline void +__pmc_generic_disable(struct perf_counter *counter, + struct hw_perf_counter *hwc, unsigned int idx) +{ + if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) + __pmc_fixed_disable(counter, hwc, idx); + else + wrmsr_safe(hwc->config_base + idx, hwc->config, 0); +} + +static DEFINE_PER_CPU(u64, prev_left[X86_PMC_IDX_MAX]); + +/* + * Set the next IRQ period, based on the hwc->period_left value. + * To be called with the counter disabled in hw: + */ +static void +__hw_perf_counter_set_period(struct perf_counter *counter, + struct hw_perf_counter *hwc, int idx) +{ + s64 left = atomic64_read(&hwc->period_left); + s32 period = hwc->irq_period; + int err; + + /* + * If we are way outside a reasoable range then just skip forward: + */ + if (unlikely(left <= -period)) { + left = period; + atomic64_set(&hwc->period_left, left); + } + + if (unlikely(left <= 0)) { + left += period; + atomic64_set(&hwc->period_left, left); + } + + per_cpu(prev_left[idx], smp_processor_id()) = left; + + /* + * The hw counter starts counting from this counter offset, + * mark it to be able to extra future deltas: + */ + atomic64_set(&hwc->prev_count, (u64)-left); + + err = checking_wrmsrl(hwc->counter_base + idx, + (u64)(-left) & counter_value_mask); +} + +static inline void +__pmc_fixed_enable(struct perf_counter *counter, + struct hw_perf_counter *hwc, unsigned int __idx) +{ + int idx = __idx - X86_PMC_IDX_FIXED; + u64 ctrl_val, bits, mask; + int err; + + /* + * Enable IRQ generation (0x8), + * and enable ring-3 counting (0x2) and ring-0 counting (0x1) + * if requested: + */ + bits = 0x8ULL; + if (hwc->config & ARCH_PERFMON_EVENTSEL_USR) + bits |= 0x2; + if (hwc->config & ARCH_PERFMON_EVENTSEL_OS) + bits |= 0x1; + bits <<= (idx * 4); + mask = 0xfULL << (idx * 4); + + rdmsrl(hwc->config_base, ctrl_val); + ctrl_val &= ~mask; + ctrl_val |= bits; + err = checking_wrmsrl(hwc->config_base, ctrl_val); +} + +static void +__pmc_generic_enable(struct perf_counter *counter, + struct hw_perf_counter *hwc, int idx) +{ + if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) + __pmc_fixed_enable(counter, hwc, idx); + else + wrmsr(hwc->config_base + idx, + hwc->config | ARCH_PERFMON_EVENTSEL0_ENABLE, 0); +} + +static int +fixed_mode_idx(struct perf_counter *counter, struct hw_perf_counter *hwc) +{ + unsigned int event; + + if (unlikely(hwc->nmi)) + return -1; + + event = hwc->config & ARCH_PERFMON_EVENT_MASK; + + if (unlikely(event == intel_perfmon_event_map[PERF_COUNT_INSTRUCTIONS])) + return X86_PMC_IDX_FIXED_INSTRUCTIONS; + if (unlikely(event == intel_perfmon_event_map[PERF_COUNT_CPU_CYCLES])) + return X86_PMC_IDX_FIXED_CPU_CYCLES; + if (unlikely(event == intel_perfmon_event_map[PERF_COUNT_BUS_CYCLES])) + return X86_PMC_IDX_FIXED_BUS_CYCLES; + + return -1; +} + +/* + * Find a PMC slot for the freshly enabled / scheduled in counter: + */ +static int pmc_generic_enable(struct perf_counter *counter) +{ + struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters); + struct hw_perf_counter *hwc = &counter->hw; + int idx; + + idx = fixed_mode_idx(counter, hwc); + if (idx >= 0) { + /* + * Try to get the fixed counter, if that is already taken + * then try to get a generic counter: + */ + if (test_and_set_bit(idx, cpuc->used)) + goto try_generic; + + hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL; + /* + * We set it so that counter_base + idx in wrmsr/rdmsr maps to + * MSR_ARCH_PERFMON_FIXED_CTR0 ... CTR2: + */ + hwc->counter_base = + MSR_ARCH_PERFMON_FIXED_CTR0 - X86_PMC_IDX_FIXED; + hwc->idx = idx; + } else { + idx = hwc->idx; + /* Try to get the previous generic counter again */ + if (test_and_set_bit(idx, cpuc->used)) { +try_generic: + idx = find_first_zero_bit(cpuc->used, nr_counters_generic); + if (idx == nr_counters_generic) + return -EAGAIN; + + set_bit(idx, cpuc->used); + hwc->idx = idx; + } + hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0; + hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0; + } + + perf_counters_lapic_init(hwc->nmi); + + __pmc_generic_disable(counter, hwc, idx); + + cpuc->counters[idx] = counter; + /* + * Make it visible before enabling the hw: + */ + smp_wmb(); + + __hw_perf_counter_set_period(counter, hwc, idx); + __pmc_generic_enable(counter, hwc, idx); + + return 0; +} + +void perf_counter_print_debug(void) +{ + u64 ctrl, status, overflow, pmc_ctrl, pmc_count, prev_left, fixed; + struct cpu_hw_counters *cpuc; + int cpu, idx; + + if (!nr_counters_generic) + return; + + local_irq_disable(); + + cpu = smp_processor_id(); + cpuc = &per_cpu(cpu_hw_counters, cpu); + + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, ctrl); + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status); + rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow); + rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, fixed); + + printk(KERN_INFO "\n"); + printk(KERN_INFO "CPU#%d: ctrl: %016llx\n", cpu, ctrl); + printk(KERN_INFO "CPU#%d: status: %016llx\n", cpu, status); + printk(KERN_INFO "CPU#%d: overflow: %016llx\n", cpu, overflow); + printk(KERN_INFO "CPU#%d: fixed: %016llx\n", cpu, fixed); + printk(KERN_INFO "CPU#%d: used: %016llx\n", cpu, *(u64 *)cpuc->used); + + for (idx = 0; idx < nr_counters_generic; idx++) { + rdmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc_ctrl); + rdmsrl(MSR_ARCH_PERFMON_PERFCTR0 + idx, pmc_count); + + prev_left = per_cpu(prev_left[idx], cpu); + + printk(KERN_INFO "CPU#%d: gen-PMC%d ctrl: %016llx\n", + cpu, idx, pmc_ctrl); + printk(KERN_INFO "CPU#%d: gen-PMC%d count: %016llx\n", + cpu, idx, pmc_count); + printk(KERN_INFO "CPU#%d: gen-PMC%d left: %016llx\n", + cpu, idx, prev_left); + } + for (idx = 0; idx < nr_counters_fixed; idx++) { + rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + idx, pmc_count); + + printk(KERN_INFO "CPU#%d: fixed-PMC%d count: %016llx\n", + cpu, idx, pmc_count); + } + local_irq_enable(); +} + +static void pmc_generic_disable(struct perf_counter *counter) +{ + struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters); + struct hw_perf_counter *hwc = &counter->hw; + unsigned int idx = hwc->idx; + + __pmc_generic_disable(counter, hwc, idx); + + clear_bit(idx, cpuc->used); + cpuc->counters[idx] = NULL; + /* + * Make sure the cleared pointer becomes visible before we + * (potentially) free the counter: + */ + smp_wmb(); + + /* + * Drain the remaining delta count out of a counter + * that we are disabling: + */ + x86_perf_counter_update(counter, hwc, idx); +} + +static void perf_store_irq_data(struct perf_counter *counter, u64 data) +{ + struct perf_data *irqdata = counter->irqdata; + + if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) { + irqdata->overrun++; + } else { + u64 *p = (u64 *) &irqdata->data[irqdata->len]; + + *p = data; + irqdata->len += sizeof(u64); + } +} + +/* + * Save and restart an expired counter. Called by NMI contexts, + * so it has to be careful about preempting normal counter ops: + */ +static void perf_save_and_restart(struct perf_counter *counter) +{ + struct hw_perf_counter *hwc = &counter->hw; + int idx = hwc->idx; + + x86_perf_counter_update(counter, hwc, idx); + __hw_perf_counter_set_period(counter, hwc, idx); + + if (counter->state == PERF_COUNTER_STATE_ACTIVE) + __pmc_generic_enable(counter, hwc, idx); +} + +static void +perf_handle_group(struct perf_counter *sibling, u64 *status, u64 *overflown) +{ + struct perf_counter *counter, *group_leader = sibling->group_leader; + + /* + * Store sibling timestamps (if any): + */ + list_for_each_entry(counter, &group_leader->sibling_list, list_entry) { + + x86_perf_counter_update(counter, &counter->hw, counter->hw.idx); + perf_store_irq_data(sibling, counter->hw_event.type); + perf_store_irq_data(sibling, atomic64_read(&counter->count)); + } +} + +/* + * Maximum interrupt frequency of 100KHz per CPU + */ +#define PERFMON_MAX_INTERRUPTS 100000/HZ + +/* + * This handler is triggered by the local APIC, so the APIC IRQ handling + * rules apply: + */ +static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi) +{ + int bit, cpu = smp_processor_id(); + u64 ack, status; + struct cpu_hw_counters *cpuc = &per_cpu(cpu_hw_counters, cpu); + + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cpuc->global_enable); + + /* Disable counters globally */ + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); + ack_APIC_irq(); + + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status); + if (!status) + goto out; + +again: + inc_irq_stat(apic_perf_irqs); + ack = status; + for_each_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) { + struct perf_counter *counter = cpuc->counters[bit]; + + clear_bit(bit, (unsigned long *) &status); + if (!counter) + continue; + + perf_save_and_restart(counter); + + switch (counter->hw_event.record_type) { + case PERF_RECORD_SIMPLE: + continue; + case PERF_RECORD_IRQ: + perf_store_irq_data(counter, instruction_pointer(regs)); + break; + case PERF_RECORD_GROUP: + perf_handle_group(counter, &status, &ack); + break; + } + /* + * From NMI context we cannot call into the scheduler to + * do a task wakeup - but we mark these generic as + * wakeup_pending and initate a wakeup callback: + */ + if (nmi) { + counter->wakeup_pending = 1; + set_tsk_thread_flag(current, TIF_PERF_COUNTERS); + } else { + wake_up(&counter->waitq); + } + } + + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack); + + /* + * Repeat if there is more work to be done: + */ + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status); + if (status) + goto again; +out: + /* + * Restore - do not reenable when global enable is off or throttled: + */ + if (++cpuc->interrupts < PERFMON_MAX_INTERRUPTS) + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cpuc->global_enable); +} + +void perf_counter_unthrottle(void) +{ + struct cpu_hw_counters *cpuc; + u64 global_enable; + + if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) + return; + + if (unlikely(!perf_counters_initialized)) + return; + + cpuc = &per_cpu(cpu_hw_counters, smp_processor_id()); + if (cpuc->interrupts >= PERFMON_MAX_INTERRUPTS) { + if (printk_ratelimit()) + printk(KERN_WARNING "PERFMON: max interrupts exceeded!\n"); + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cpuc->global_enable); + } + rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, global_enable); + if (unlikely(cpuc->global_enable && !global_enable)) + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cpuc->global_enable); + cpuc->interrupts = 0; +} + +void smp_perf_counter_interrupt(struct pt_regs *regs) +{ + irq_enter(); + apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR); + __smp_perf_counter_interrupt(regs, 0); + + irq_exit(); +} + +/* + * This handler is triggered by NMI contexts: + */ +void perf_counter_notify(struct pt_regs *regs) +{ + struct cpu_hw_counters *cpuc; + unsigned long flags; + int bit, cpu; + + local_irq_save(flags); + cpu = smp_processor_id(); + cpuc = &per_cpu(cpu_hw_counters, cpu); + + for_each_bit(bit, cpuc->used, X86_PMC_IDX_MAX) { + struct perf_counter *counter = cpuc->counters[bit]; + + if (!counter) + continue; + + if (counter->wakeup_pending) { + counter->wakeup_pending = 0; + wake_up(&counter->waitq); + } + } + + local_irq_restore(flags); +} + +void perf_counters_lapic_init(int nmi) +{ + u32 apic_val; + + if (!perf_counters_initialized) + return; + /* + * Enable the performance counter vector in the APIC LVT: + */ + apic_val = apic_read(APIC_LVTERR); + + apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED); + if (nmi) + apic_write(APIC_LVTPC, APIC_DM_NMI); + else + apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR); + apic_write(APIC_LVTERR, apic_val); +} + +static int __kprobes +perf_counter_nmi_handler(struct notifier_block *self, + unsigned long cmd, void *__args) +{ + struct die_args *args = __args; + struct pt_regs *regs; + + if (likely(cmd != DIE_NMI_IPI)) + return NOTIFY_DONE; + + regs = args->regs; + + apic_write(APIC_LVTPC, APIC_DM_NMI); + __smp_perf_counter_interrupt(regs, 1); + + return NOTIFY_STOP; +} + +static __read_mostly struct notifier_block perf_counter_nmi_notifier = { + .notifier_call = perf_counter_nmi_handler, + .next = NULL, + .priority = 1 +}; + +void __init init_hw_perf_counters(void) +{ + union cpuid10_eax eax; + unsigned int ebx; + unsigned int unused; + union cpuid10_edx edx; + + if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) + return; + + /* + * Check whether the Architectural PerfMon supports + * Branch Misses Retired Event or not. + */ + cpuid(10, &eax.full, &ebx, &unused, &edx.full); + if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED) + return; + + printk(KERN_INFO "Intel Performance Monitoring support detected.\n"); + + printk(KERN_INFO "... version: %d\n", eax.split.version_id); + printk(KERN_INFO "... num counters: %d\n", eax.split.num_counters); + nr_counters_generic = eax.split.num_counters; + if (nr_counters_generic > X86_PMC_MAX_GENERIC) { + nr_counters_generic = X86_PMC_MAX_GENERIC; + WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!", + nr_counters_generic, X86_PMC_MAX_GENERIC); + } + perf_counter_mask = (1 << nr_counters_generic) - 1; + perf_max_counters = nr_counters_generic; + + printk(KERN_INFO "... bit width: %d\n", eax.split.bit_width); + counter_value_mask = (1ULL << eax.split.bit_width) - 1; + printk(KERN_INFO "... value mask: %016Lx\n", counter_value_mask); + + printk(KERN_INFO "... mask length: %d\n", eax.split.mask_length); + + nr_counters_fixed = edx.split.num_counters_fixed; + if (nr_counters_fixed > X86_PMC_MAX_FIXED) { + nr_counters_fixed = X86_PMC_MAX_FIXED; + WARN(1, KERN_ERR "hw perf counters fixed %d > max(%d), clipping!", + nr_counters_fixed, X86_PMC_MAX_FIXED); + } + printk(KERN_INFO "... fixed counters: %d\n", nr_counters_fixed); + + perf_counter_mask |= ((1LL << nr_counters_fixed)-1) << X86_PMC_IDX_FIXED; + + printk(KERN_INFO "... counter mask: %016Lx\n", perf_counter_mask); + perf_counters_initialized = true; + + perf_counters_lapic_init(0); + register_die_notifier(&perf_counter_nmi_notifier); +} + +static void pmc_generic_read(struct perf_counter *counter) +{ + x86_perf_counter_update(counter, &counter->hw, counter->hw.idx); +} + +static const struct hw_perf_counter_ops x86_perf_counter_ops = { + .enable = pmc_generic_enable, + .disable = pmc_generic_disable, + .read = pmc_generic_read, +}; + +const struct hw_perf_counter_ops * +hw_perf_counter_init(struct perf_counter *counter) +{ + int err; + + err = __hw_perf_counter_init(counter); + if (err) + return NULL; + + return &x86_perf_counter_ops; +} Index: linux-2.6-tip/arch/x86/kernel/cpu/perfctr-watchdog.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/perfctr-watchdog.c +++ linux-2.6-tip/arch/x86/kernel/cpu/perfctr-watchdog.c @@ -20,7 +20,7 @@ #include #include -#include +#include struct nmi_watchdog_ctlblk { unsigned int cccr_msr; Index: linux-2.6-tip/arch/x86/kernel/crash.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/crash.c +++ linux-2.6-tip/arch/x86/kernel/crash.c @@ -24,12 +24,10 @@ #include #include #include -#include +#include #include #include -#include - #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC) Index: linux-2.6-tip/arch/x86/kernel/dumpstack.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/dumpstack.c +++ linux-2.6-tip/arch/x86/kernel/dumpstack.c @@ -10,10 +10,12 @@ #include #include #include +#include #include #include #include #include +#include #include @@ -99,7 +101,7 @@ print_context_stack(struct thread_info * frame = frame->next_frame; bp = (unsigned long) frame; } else { - ops->address(data, addr, bp == 0); + ops->address(data, addr, 0); } print_ftrace_graph_addr(addr, data, ops, tinfo, graph); } @@ -186,7 +188,7 @@ void dump_stack(void) } EXPORT_SYMBOL(dump_stack); -static raw_spinlock_t die_lock = __RAW_SPIN_LOCK_UNLOCKED; +static raw_spinlock_t die_lock = RAW_SPIN_LOCK_UNLOCKED(die_lock); static int die_owner = -1; static unsigned int die_nest_count; @@ -195,16 +197,21 @@ unsigned __kprobes long oops_begin(void) int cpu; unsigned long flags; + /* notify the hw-branch tracer so it may disable tracing and + add the last trace to the trace buffer - + the earlier this happens, the more useful the trace. */ + trace_hw_branch_oops(); + oops_enter(); /* racy, but better than risking deadlock. */ raw_local_irq_save(flags); cpu = smp_processor_id(); - if (!__raw_spin_trylock(&die_lock)) { + if (!spin_trylock(&die_lock)) { if (cpu == die_owner) /* nested oops. should stop eventually */; else - __raw_spin_lock(&die_lock); + spin_lock(&die_lock); } die_nest_count++; die_owner = cpu; @@ -224,7 +231,7 @@ void __kprobes oops_end(unsigned long fl die_nest_count--; if (!die_nest_count) /* Nest count reaches zero, release the lock. */ - __raw_spin_unlock(&die_lock); + spin_unlock(&die_lock); raw_local_irq_restore(flags); oops_exit(); Index: linux-2.6-tip/arch/x86/kernel/dumpstack_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/dumpstack_64.c +++ linux-2.6-tip/arch/x86/kernel/dumpstack_64.c @@ -23,10 +23,14 @@ static unsigned long *in_exception_stack unsigned *usedp, char **idp) { static char ids[][8] = { +#if DEBUG_STACK > 0 [DEBUG_STACK - 1] = "#DB", +#endif [NMI_STACK - 1] = "NMI", [DOUBLEFAULT_STACK - 1] = "#DF", +#if STACKFAULT_STACK > 0 [STACKFAULT_STACK - 1] = "#SS", +#endif [MCE_STACK - 1] = "#MC", #if DEBUG_STKSZ > EXCEPTION_STKSZ [N_EXCEPTION_STACKS ... @@ -106,7 +110,8 @@ void dump_trace(struct task_struct *task const struct stacktrace_ops *ops, void *data) { const unsigned cpu = get_cpu(); - unsigned long *irqstack_end = (unsigned long *)cpu_pda(cpu)->irqstackptr; + unsigned long *irq_stack_end = + (unsigned long *)per_cpu(irq_stack_ptr, cpu); unsigned used = 0; struct thread_info *tinfo; int graph = 0; @@ -160,23 +165,23 @@ void dump_trace(struct task_struct *task stack = (unsigned long *) estack_end[-2]; continue; } - if (irqstack_end) { - unsigned long *irqstack; - irqstack = irqstack_end - - (IRQSTACKSIZE - 64) / sizeof(*irqstack); + if (irq_stack_end) { + unsigned long *irq_stack; + irq_stack = irq_stack_end - + (IRQ_STACK_SIZE - 64) / sizeof(*irq_stack); - if (stack >= irqstack && stack < irqstack_end) { + if (stack >= irq_stack && stack < irq_stack_end) { if (ops->stack(data, "IRQ") < 0) break; bp = print_context_stack(tinfo, stack, bp, - ops, data, irqstack_end, &graph); + ops, data, irq_stack_end, &graph); /* * We link to the next stack (which would be * the process stack normally) the last * pointer (index -1 to end) in the IRQ stack: */ - stack = (unsigned long *) (irqstack_end[-1]); - irqstack_end = NULL; + stack = (unsigned long *) (irq_stack_end[-1]); + irq_stack_end = NULL; ops->stack(data, "EOI"); continue; } @@ -199,10 +204,10 @@ show_stack_log_lvl(struct task_struct *t unsigned long *stack; int i; const int cpu = smp_processor_id(); - unsigned long *irqstack_end = - (unsigned long *) (cpu_pda(cpu)->irqstackptr); - unsigned long *irqstack = - (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE); + unsigned long *irq_stack_end = + (unsigned long *)(per_cpu(irq_stack_ptr, cpu)); + unsigned long *irq_stack = + (unsigned long *)(per_cpu(irq_stack_ptr, cpu) - IRQ_STACK_SIZE); /* * debugging aid: "show_stack(NULL, NULL);" prints the @@ -218,9 +223,9 @@ show_stack_log_lvl(struct task_struct *t stack = sp; for (i = 0; i < kstack_depth_to_print; i++) { - if (stack >= irqstack && stack <= irqstack_end) { - if (stack == irqstack_end) { - stack = (unsigned long *) (irqstack_end[-1]); + if (stack >= irq_stack && stack <= irq_stack_end) { + if (stack == irq_stack_end) { + stack = (unsigned long *) (irq_stack_end[-1]); printk(" "); } } else { @@ -241,7 +246,7 @@ void show_registers(struct pt_regs *regs int i; unsigned long sp; const int cpu = smp_processor_id(); - struct task_struct *cur = cpu_pda(cpu)->pcurrent; + struct task_struct *cur = current; sp = regs->sp; printk("CPU %d ", cpu); Index: linux-2.6-tip/arch/x86/kernel/early-quirks.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/early-quirks.c +++ linux-2.6-tip/arch/x86/kernel/early-quirks.c @@ -97,6 +97,7 @@ static void __init nvidia_bugs(int num, } #if defined(CONFIG_ACPI) && defined(CONFIG_X86_IO_APIC) +#if defined(CONFIG_ACPI) && defined(CONFIG_X86_IO_APIC) static u32 __init ati_ixp4x0_rev(int num, int slot, int func) { u32 d; @@ -114,6 +115,7 @@ static u32 __init ati_ixp4x0_rev(int num d &= 0xff; return d; } +#endif static void __init ati_bugs(int num, int slot, int func) { Index: linux-2.6-tip/arch/x86/kernel/early_printk.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/early_printk.c +++ linux-2.6-tip/arch/x86/kernel/early_printk.c @@ -13,8 +13,8 @@ #include #include #include -#include #include +#include #include /* Simple VGA output */ @@ -59,7 +59,7 @@ static void early_vga_write(struct conso static struct console early_vga_console = { .name = "earlyvga", .write = early_vga_write, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; @@ -156,7 +156,7 @@ static __init void early_serial_init(cha static struct console early_serial_console = { .name = "earlyser", .write = early_serial_write, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; @@ -881,7 +881,7 @@ static int __initdata early_console_init asmlinkage void early_printk(const char *fmt, ...) { - char buf[512]; + static char buf[512]; int n; va_list ap; Index: linux-2.6-tip/arch/x86/kernel/efi.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/efi.c +++ linux-2.6-tip/arch/x86/kernel/efi.c @@ -366,10 +366,12 @@ void __init efi_init(void) SMBIOS_TABLE_GUID)) { efi.smbios = config_tables[i].table; printk(" SMBIOS=0x%lx ", config_tables[i].table); +#ifdef CONFIG_X86_UV } else if (!efi_guidcmp(config_tables[i].guid, UV_SYSTEM_TABLE_GUID)) { efi.uv_systab = config_tables[i].table; printk(" UVsystab=0x%lx ", config_tables[i].table); +#endif } else if (!efi_guidcmp(config_tables[i].guid, HCDP_TABLE_GUID)) { efi.hcdp = config_tables[i].table; Index: linux-2.6-tip/arch/x86/kernel/efi_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/efi_64.c +++ linux-2.6-tip/arch/x86/kernel/efi_64.c @@ -36,6 +36,7 @@ #include #include #include +#include static pgd_t save_pgd __initdata; static unsigned long efi_flags __initdata; Index: linux-2.6-tip/arch/x86/kernel/efi_stub_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/efi_stub_32.S +++ linux-2.6-tip/arch/x86/kernel/efi_stub_32.S @@ -6,7 +6,7 @@ */ #include -#include +#include /* * efi_call_phys(void *, ...) is a function with variable parameters. Index: linux-2.6-tip/arch/x86/kernel/entry_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/entry_32.S +++ linux-2.6-tip/arch/x86/kernel/entry_32.S @@ -30,12 +30,13 @@ * 1C(%esp) - %ds * 20(%esp) - %es * 24(%esp) - %fs - * 28(%esp) - orig_eax - * 2C(%esp) - %eip - * 30(%esp) - %cs - * 34(%esp) - %eflags - * 38(%esp) - %oldesp - * 3C(%esp) - %oldss + * 28(%esp) - %gs saved iff !CONFIG_X86_32_LAZY_GS + * 2C(%esp) - orig_eax + * 30(%esp) - %eip + * 34(%esp) - %cs + * 38(%esp) - %eflags + * 3C(%esp) - %oldesp + * 40(%esp) - %oldss * * "current" is in register %ebx during any slow entries. */ @@ -46,7 +47,7 @@ #include #include #include -#include +#include #include #include #include @@ -101,121 +102,221 @@ #define resume_userspace_sig resume_userspace #endif -#define SAVE_ALL \ - cld; \ - pushl %fs; \ - CFI_ADJUST_CFA_OFFSET 4;\ - /*CFI_REL_OFFSET fs, 0;*/\ - pushl %es; \ - CFI_ADJUST_CFA_OFFSET 4;\ - /*CFI_REL_OFFSET es, 0;*/\ - pushl %ds; \ - CFI_ADJUST_CFA_OFFSET 4;\ - /*CFI_REL_OFFSET ds, 0;*/\ - pushl %eax; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET eax, 0;\ - pushl %ebp; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET ebp, 0;\ - pushl %edi; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET edi, 0;\ - pushl %esi; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET esi, 0;\ - pushl %edx; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET edx, 0;\ - pushl %ecx; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET ecx, 0;\ - pushl %ebx; \ - CFI_ADJUST_CFA_OFFSET 4;\ - CFI_REL_OFFSET ebx, 0;\ - movl $(__USER_DS), %edx; \ - movl %edx, %ds; \ - movl %edx, %es; \ - movl $(__KERNEL_PERCPU), %edx; \ +/* + * User gs save/restore + * + * %gs is used for userland TLS and kernel only uses it for stack + * canary which is required to be at %gs:20 by gcc. Read the comment + * at the top of stackprotector.h for more info. + * + * Local labels 98 and 99 are used. + */ +#ifdef CONFIG_X86_32_LAZY_GS + + /* unfortunately push/pop can't be no-op */ +.macro PUSH_GS + pushl $0 + CFI_ADJUST_CFA_OFFSET 4 +.endm +.macro POP_GS pop=0 + addl $(4 + \pop), %esp + CFI_ADJUST_CFA_OFFSET -(4 + \pop) +.endm +.macro POP_GS_EX +.endm + + /* all the rest are no-op */ +.macro PTGS_TO_GS +.endm +.macro PTGS_TO_GS_EX +.endm +.macro GS_TO_REG reg +.endm +.macro REG_TO_PTGS reg +.endm +.macro SET_KERNEL_GS reg +.endm + +#else /* CONFIG_X86_32_LAZY_GS */ + +.macro PUSH_GS + pushl %gs + CFI_ADJUST_CFA_OFFSET 4 + /*CFI_REL_OFFSET gs, 0*/ +.endm + +.macro POP_GS pop=0 +98: popl %gs + CFI_ADJUST_CFA_OFFSET -4 + /*CFI_RESTORE gs*/ + .if \pop <> 0 + add $\pop, %esp + CFI_ADJUST_CFA_OFFSET -\pop + .endif +.endm +.macro POP_GS_EX +.pushsection .fixup, "ax" +99: movl $0, (%esp) + jmp 98b +.section __ex_table, "a" + .align 4 + .long 98b, 99b +.popsection +.endm + +.macro PTGS_TO_GS +98: mov PT_GS(%esp), %gs +.endm +.macro PTGS_TO_GS_EX +.pushsection .fixup, "ax" +99: movl $0, PT_GS(%esp) + jmp 98b +.section __ex_table, "a" + .align 4 + .long 98b, 99b +.popsection +.endm + +.macro GS_TO_REG reg + movl %gs, \reg + /*CFI_REGISTER gs, \reg*/ +.endm +.macro REG_TO_PTGS reg + movl \reg, PT_GS(%esp) + /*CFI_REL_OFFSET gs, PT_GS*/ +.endm +.macro SET_KERNEL_GS reg + movl $(__KERNEL_STACK_CANARY), \reg + movl \reg, %gs +.endm + +#endif /* CONFIG_X86_32_LAZY_GS */ + +.macro SAVE_ALL + cld + PUSH_GS + pushl %fs + CFI_ADJUST_CFA_OFFSET 4 + /*CFI_REL_OFFSET fs, 0;*/ + pushl %es + CFI_ADJUST_CFA_OFFSET 4 + /*CFI_REL_OFFSET es, 0;*/ + pushl %ds + CFI_ADJUST_CFA_OFFSET 4 + /*CFI_REL_OFFSET ds, 0;*/ + pushl %eax + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET eax, 0 + pushl %ebp + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET ebp, 0 + pushl %edi + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET edi, 0 + pushl %esi + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET esi, 0 + pushl %edx + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET edx, 0 + pushl %ecx + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET ecx, 0 + pushl %ebx + CFI_ADJUST_CFA_OFFSET 4 + CFI_REL_OFFSET ebx, 0 + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + movl $(__KERNEL_PERCPU), %edx movl %edx, %fs + SET_KERNEL_GS %edx +.endm -#define RESTORE_INT_REGS \ - popl %ebx; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE ebx;\ - popl %ecx; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE ecx;\ - popl %edx; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE edx;\ - popl %esi; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE esi;\ - popl %edi; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE edi;\ - popl %ebp; \ - CFI_ADJUST_CFA_OFFSET -4;\ - CFI_RESTORE ebp;\ - popl %eax; \ - CFI_ADJUST_CFA_OFFSET -4;\ +.macro RESTORE_INT_REGS + popl %ebx + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE ebx + popl %ecx + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE ecx + popl %edx + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE edx + popl %esi + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE esi + popl %edi + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE edi + popl %ebp + CFI_ADJUST_CFA_OFFSET -4 + CFI_RESTORE ebp + popl %eax + CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE eax +.endm -#define RESTORE_REGS \ - RESTORE_INT_REGS; \ -1: popl %ds; \ - CFI_ADJUST_CFA_OFFSET -4;\ - /*CFI_RESTORE ds;*/\ -2: popl %es; \ - CFI_ADJUST_CFA_OFFSET -4;\ - /*CFI_RESTORE es;*/\ -3: popl %fs; \ - CFI_ADJUST_CFA_OFFSET -4;\ - /*CFI_RESTORE fs;*/\ -.pushsection .fixup,"ax"; \ -4: movl $0,(%esp); \ - jmp 1b; \ -5: movl $0,(%esp); \ - jmp 2b; \ -6: movl $0,(%esp); \ - jmp 3b; \ -.section __ex_table,"a";\ - .align 4; \ - .long 1b,4b; \ - .long 2b,5b; \ - .long 3b,6b; \ +.macro RESTORE_REGS pop=0 + RESTORE_INT_REGS +1: popl %ds + CFI_ADJUST_CFA_OFFSET -4 + /*CFI_RESTORE ds;*/ +2: popl %es + CFI_ADJUST_CFA_OFFSET -4 + /*CFI_RESTORE es;*/ +3: popl %fs + CFI_ADJUST_CFA_OFFSET -4 + /*CFI_RESTORE fs;*/ + POP_GS \pop +.pushsection .fixup, "ax" +4: movl $0, (%esp) + jmp 1b +5: movl $0, (%esp) + jmp 2b +6: movl $0, (%esp) + jmp 3b +.section __ex_table, "a" + .align 4 + .long 1b, 4b + .long 2b, 5b + .long 3b, 6b .popsection + POP_GS_EX +.endm -#define RING0_INT_FRAME \ - CFI_STARTPROC simple;\ - CFI_SIGNAL_FRAME;\ - CFI_DEF_CFA esp, 3*4;\ - /*CFI_OFFSET cs, -2*4;*/\ +.macro RING0_INT_FRAME + CFI_STARTPROC simple + CFI_SIGNAL_FRAME + CFI_DEF_CFA esp, 3*4 + /*CFI_OFFSET cs, -2*4;*/ CFI_OFFSET eip, -3*4 +.endm -#define RING0_EC_FRAME \ - CFI_STARTPROC simple;\ - CFI_SIGNAL_FRAME;\ - CFI_DEF_CFA esp, 4*4;\ - /*CFI_OFFSET cs, -2*4;*/\ +.macro RING0_EC_FRAME + CFI_STARTPROC simple + CFI_SIGNAL_FRAME + CFI_DEF_CFA esp, 4*4 + /*CFI_OFFSET cs, -2*4;*/ CFI_OFFSET eip, -3*4 +.endm -#define RING0_PTREGS_FRAME \ - CFI_STARTPROC simple;\ - CFI_SIGNAL_FRAME;\ - CFI_DEF_CFA esp, PT_OLDESP-PT_EBX;\ - /*CFI_OFFSET cs, PT_CS-PT_OLDESP;*/\ - CFI_OFFSET eip, PT_EIP-PT_OLDESP;\ - /*CFI_OFFSET es, PT_ES-PT_OLDESP;*/\ - /*CFI_OFFSET ds, PT_DS-PT_OLDESP;*/\ - CFI_OFFSET eax, PT_EAX-PT_OLDESP;\ - CFI_OFFSET ebp, PT_EBP-PT_OLDESP;\ - CFI_OFFSET edi, PT_EDI-PT_OLDESP;\ - CFI_OFFSET esi, PT_ESI-PT_OLDESP;\ - CFI_OFFSET edx, PT_EDX-PT_OLDESP;\ - CFI_OFFSET ecx, PT_ECX-PT_OLDESP;\ +.macro RING0_PTREGS_FRAME + CFI_STARTPROC simple + CFI_SIGNAL_FRAME + CFI_DEF_CFA esp, PT_OLDESP-PT_EBX + /*CFI_OFFSET cs, PT_CS-PT_OLDESP;*/ + CFI_OFFSET eip, PT_EIP-PT_OLDESP + /*CFI_OFFSET es, PT_ES-PT_OLDESP;*/ + /*CFI_OFFSET ds, PT_DS-PT_OLDESP;*/ + CFI_OFFSET eax, PT_EAX-PT_OLDESP + CFI_OFFSET ebp, PT_EBP-PT_OLDESP + CFI_OFFSET edi, PT_EDI-PT_OLDESP + CFI_OFFSET esi, PT_ESI-PT_OLDESP + CFI_OFFSET edx, PT_EDX-PT_OLDESP + CFI_OFFSET ecx, PT_ECX-PT_OLDESP CFI_OFFSET ebx, PT_EBX-PT_OLDESP +.endm ENTRY(ret_from_fork) CFI_STARTPROC @@ -270,14 +371,18 @@ END(ret_from_exception) #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, kernel_preemption + jz restore_nocheck cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? jnz restore_nocheck need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl - jz restore_all + jz restore_nocheck testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all + jz restore_nocheck + DISABLE_INTERRUPTS(CLBR_ANY) + call preempt_schedule_irq jmp need_resched END(resume_kernel) @@ -362,6 +467,7 @@ sysenter_exit: xorl %ebp,%ebp TRACE_IRQS_ON 1: mov PT_FS(%esp), %fs + PTGS_TO_GS ENABLE_INTERRUPTS_SYSEXIT #ifdef CONFIG_AUDITSYSCALL @@ -410,6 +516,7 @@ sysexit_audit: .align 4 .long 1b,2b .popsection + PTGS_TO_GS_EX ENDPROC(ia32_sysenter_target) # system call handler stub @@ -452,8 +559,7 @@ restore_all: restore_nocheck: TRACE_IRQS_IRET restore_nocheck_notrace: - RESTORE_REGS - addl $4, %esp # skip orig_eax/error_code + RESTORE_REGS 4 # skip orig_eax/error_code CFI_ADJUST_CFA_OFFSET -4 irq_return: INTERRUPT_RETURN @@ -513,20 +619,19 @@ ENDPROC(system_call) ALIGN RING0_PTREGS_FRAME # can't unwind into user space anyway work_pending: - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED), %ecx jz work_notifysig work_resched: - call schedule + call __schedule LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? jz restore_all - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED), %ecx jnz work_resched work_notifysig: # deal with pending signals and @@ -595,28 +700,50 @@ syscall_badsys: END(syscall_badsys) CFI_ENDPROC -#define FIXUP_ESPFIX_STACK \ - /* since we are on a wrong stack, we cant make it a C code :( */ \ - PER_CPU(gdt_page, %ebx); \ - GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \ - addl %esp, %eax; \ - pushl $__KERNEL_DS; \ - CFI_ADJUST_CFA_OFFSET 4; \ - pushl %eax; \ - CFI_ADJUST_CFA_OFFSET 4; \ - lss (%esp), %esp; \ - CFI_ADJUST_CFA_OFFSET -8; -#define UNWIND_ESPFIX_STACK \ - movl %ss, %eax; \ - /* see if on espfix stack */ \ - cmpw $__ESPFIX_SS, %ax; \ - jne 27f; \ - movl $__KERNEL_DS, %eax; \ - movl %eax, %ds; \ - movl %eax, %es; \ - /* switch to normal stack */ \ - FIXUP_ESPFIX_STACK; \ -27:; +/* + * System calls that need a pt_regs pointer. + */ +#define PTREGSCALL(name) \ + ALIGN; \ +ptregs_##name: \ + leal 4(%esp),%eax; \ + jmp sys_##name; + +PTREGSCALL(iopl) +PTREGSCALL(fork) +PTREGSCALL(clone) +PTREGSCALL(vfork) +PTREGSCALL(execve) +PTREGSCALL(sigaltstack) +PTREGSCALL(sigreturn) +PTREGSCALL(rt_sigreturn) +PTREGSCALL(vm86) +PTREGSCALL(vm86old) + +.macro FIXUP_ESPFIX_STACK + /* since we are on a wrong stack, we cant make it a C code :( */ + PER_CPU(gdt_page, %ebx) + GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah) + addl %esp, %eax + pushl $__KERNEL_DS + CFI_ADJUST_CFA_OFFSET 4 + pushl %eax + CFI_ADJUST_CFA_OFFSET 4 + lss (%esp), %esp + CFI_ADJUST_CFA_OFFSET -8 +.endm +.macro UNWIND_ESPFIX_STACK + movl %ss, %eax + /* see if on espfix stack */ + cmpw $__ESPFIX_SS, %ax + jne 27f + movl $__KERNEL_DS, %eax + movl %eax, %ds + movl %eax, %es + /* switch to normal stack */ + FIXUP_ESPFIX_STACK +27: +.endm /* * Build the entry stubs and pointer table with some assembler magic. @@ -672,7 +799,7 @@ common_interrupt: ENDPROC(common_interrupt) CFI_ENDPROC -#define BUILD_INTERRUPT(name, nr) \ +#define BUILD_INTERRUPT3(name, nr, fn) \ ENTRY(name) \ RING0_INT_FRAME; \ pushl $~(nr); \ @@ -680,13 +807,15 @@ ENTRY(name) \ SAVE_ALL; \ TRACE_IRQS_OFF \ movl %esp,%eax; \ - call smp_##name; \ + call fn; \ jmp ret_from_intr; \ CFI_ENDPROC; \ ENDPROC(name) +#define BUILD_INTERRUPT(name, nr) BUILD_INTERRUPT3(name, nr, smp_##name) + /* The include is where all of the SMP etc. interrupts come from */ -#include "entry_arch.h" +#include ENTRY(coprocessor_error) RING0_INT_FRAME @@ -1068,7 +1197,10 @@ ENTRY(page_fault) CFI_ADJUST_CFA_OFFSET 4 ALIGN error_code: - /* the function address is in %fs's slot on the stack */ + /* the function address is in %gs's slot on the stack */ + pushl %fs + CFI_ADJUST_CFA_OFFSET 4 + /*CFI_REL_OFFSET fs, 0*/ pushl %es CFI_ADJUST_CFA_OFFSET 4 /*CFI_REL_OFFSET es, 0*/ @@ -1097,20 +1229,15 @@ error_code: CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ebx, 0 cld - pushl %fs - CFI_ADJUST_CFA_OFFSET 4 - /*CFI_REL_OFFSET fs, 0*/ movl $(__KERNEL_PERCPU), %ecx movl %ecx, %fs UNWIND_ESPFIX_STACK - popl %ecx - CFI_ADJUST_CFA_OFFSET -4 - /*CFI_REGISTER es, ecx*/ - movl PT_FS(%esp), %edi # get the function address + GS_TO_REG %ecx + movl PT_GS(%esp), %edi # get the function address movl PT_ORIG_EAX(%esp), %edx # get the error code movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart - mov %ecx, PT_FS(%esp) - /*CFI_REL_OFFSET fs, ES*/ + REG_TO_PTGS %ecx + SET_KERNEL_GS %ecx movl $(__USER_DS), %ecx movl %ecx, %ds movl %ecx, %es @@ -1134,26 +1261,27 @@ END(page_fault) * by hand onto the new stack - while updating the return eip past * the instruction that would have done it for sysenter. */ -#define FIX_STACK(offset, ok, label) \ - cmpw $__KERNEL_CS,4(%esp); \ - jne ok; \ -label: \ - movl TSS_sysenter_sp0+offset(%esp),%esp; \ - CFI_DEF_CFA esp, 0; \ - CFI_UNDEFINED eip; \ - pushfl; \ - CFI_ADJUST_CFA_OFFSET 4; \ - pushl $__KERNEL_CS; \ - CFI_ADJUST_CFA_OFFSET 4; \ - pushl $sysenter_past_esp; \ - CFI_ADJUST_CFA_OFFSET 4; \ +.macro FIX_STACK offset ok label + cmpw $__KERNEL_CS, 4(%esp) + jne \ok +\label: + movl TSS_sysenter_sp0 + \offset(%esp), %esp + CFI_DEF_CFA esp, 0 + CFI_UNDEFINED eip + pushfl + CFI_ADJUST_CFA_OFFSET 4 + pushl $__KERNEL_CS + CFI_ADJUST_CFA_OFFSET 4 + pushl $sysenter_past_esp + CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET eip, 0 +.endm ENTRY(debug) RING0_INT_FRAME cmpl $ia32_sysenter_target,(%esp) jne debug_stack_correct - FIX_STACK(12, debug_stack_correct, debug_esp_fix_insn) + FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn debug_stack_correct: pushl $-1 # mark this as an int CFI_ADJUST_CFA_OFFSET 4 @@ -1211,7 +1339,7 @@ nmi_stack_correct: nmi_stack_fixup: RING0_INT_FRAME - FIX_STACK(12,nmi_stack_correct, 1) + FIX_STACK 12, nmi_stack_correct, 1 jmp nmi_stack_correct nmi_debug_stack_check: @@ -1222,7 +1350,7 @@ nmi_debug_stack_check: jb nmi_stack_correct cmpl $debug_esp_fix_insn,(%esp) ja nmi_stack_correct - FIX_STACK(24,nmi_stack_correct, 1) + FIX_STACK 24, nmi_stack_correct, 1 jmp nmi_stack_correct nmi_espfix_stack: Index: linux-2.6-tip/arch/x86/kernel/entry_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/entry_64.S +++ linux-2.6-tip/arch/x86/kernel/entry_64.S @@ -48,10 +48,11 @@ #include #include #include -#include +#include #include #include #include +#include /* Avoid __ASSEMBLER__'ifying just for this. */ #include @@ -209,7 +210,7 @@ ENTRY(native_usergs_sysret64) /* %rsp:at FRAMEEND */ .macro FIXUP_TOP_OF_STACK tmp offset=0 - movq %gs:pda_oldrsp,\tmp + movq PER_CPU_VAR(old_rsp),\tmp movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) @@ -220,7 +221,7 @@ ENTRY(native_usergs_sysret64) .macro RESTORE_TOP_OF_STACK tmp offset=0 movq RSP+\offset(%rsp),\tmp - movq \tmp,%gs:pda_oldrsp + movq \tmp,PER_CPU_VAR(old_rsp) movq EFLAGS+\offset(%rsp),\tmp movq \tmp,R11+\offset(%rsp) .endm @@ -336,15 +337,15 @@ ENTRY(save_args) je 1f SWAPGS /* - * irqcount is used to check if a CPU is already on an interrupt stack + * irq_count is used to check if a CPU is already on an interrupt stack * or not. While this is essentially redundant with preempt_count it is * a little cheaper to use a separate counter in the PDA (short of * moving irq_enter into assembly, which would be too much work) */ -1: incl %gs:pda_irqcount +1: incl PER_CPU_VAR(irq_count) jne 2f popq_cfi %rax /* move return address... */ - mov %gs:pda_irqstackptr,%rsp + mov PER_CPU_VAR(irq_stack_ptr),%rsp EMPTY_FRAME 0 pushq_cfi %rbp /* backlink for unwinder */ pushq_cfi %rax /* ... to the new stack */ @@ -409,6 +410,8 @@ END(save_paranoid) ENTRY(ret_from_fork) DEFAULT_FRAME + LOCK ; btr $TIF_FORK,TI_flags(%r8) + push kernel_eflags(%rip) CFI_ADJUST_CFA_OFFSET 8 popf # reset kernel eflags @@ -468,7 +471,7 @@ END(ret_from_fork) ENTRY(system_call) CFI_STARTPROC simple CFI_SIGNAL_FRAME - CFI_DEF_CFA rsp,PDA_STACKOFFSET + CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET CFI_REGISTER rip,rcx /*CFI_REGISTER rflags,r11*/ SWAPGS_UNSAFE_STACK @@ -479,8 +482,8 @@ ENTRY(system_call) */ ENTRY(system_call_after_swapgs) - movq %rsp,%gs:pda_oldrsp - movq %gs:pda_kernelstack,%rsp + movq %rsp,PER_CPU_VAR(old_rsp) + movq PER_CPU_VAR(kernel_stack),%rsp /* * No need to follow this irqs off/on section - it's straight * and short: @@ -523,7 +526,7 @@ sysret_check: CFI_REGISTER rip,rcx RESTORE_ARGS 0,-ARG_SKIP,1 /*CFI_REGISTER rflags,r11*/ - movq %gs:pda_oldrsp, %rsp + movq PER_CPU_VAR(old_rsp), %rsp USERGS_SYSRET64 CFI_RESTORE_STATE @@ -833,11 +836,11 @@ common_interrupt: XCPT_FRAME addq $-0x80,(%rsp) /* Adjust vector to [-256,-1] range */ interrupt do_IRQ - /* 0(%rsp): oldrsp-ARGOFFSET */ + /* 0(%rsp): old_rsp-ARGOFFSET */ ret_from_intr: DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF - decl %gs:pda_irqcount + decl PER_CPU_VAR(irq_count) leaveq CFI_DEF_CFA_REGISTER rsp CFI_ADJUST_CFA_OFFSET -8 @@ -982,8 +985,10 @@ apicinterrupt IRQ_MOVE_CLEANUP_VECTOR \ irq_move_cleanup_interrupt smp_irq_move_cleanup_interrupt #endif +#ifdef CONFIG_X86_UV apicinterrupt UV_BAU_MESSAGE \ uv_bau_message_intr1 uv_bau_message_interrupt +#endif apicinterrupt LOCAL_TIMER_VECTOR \ apic_timer_interrupt smp_apic_timer_interrupt @@ -1025,6 +1030,11 @@ apicinterrupt ERROR_APIC_VECTOR \ apicinterrupt SPURIOUS_APIC_VECTOR \ spurious_interrupt smp_spurious_interrupt +#ifdef CONFIG_PERF_COUNTERS +apicinterrupt LOCAL_PERF_VECTOR \ + perf_counter_interrupt smp_perf_counter_interrupt +#endif + /* * Exception entry points. */ @@ -1073,10 +1083,10 @@ ENTRY(\sym) TRACE_IRQS_OFF movq %rsp,%rdi /* pt_regs pointer */ xorl %esi,%esi /* no error code */ - movq %gs:pda_data_offset, %rbp - subq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp) + PER_CPU(init_tss, %rbp) + subq $EXCEPTION_STKSZ, TSS_ist + (\ist - 1) * 8(%rbp) call \do_sym - addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp) + addq $EXCEPTION_STKSZ, TSS_ist + (\ist - 1) * 8(%rbp) jmp paranoid_exit /* %ebx: no swapgs flag */ CFI_ENDPROC END(\sym) @@ -1138,7 +1148,7 @@ ENTRY(native_load_gs_index) CFI_STARTPROC pushf CFI_ADJUST_CFA_OFFSET 8 - DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI)) + DISABLE_INTERRUPTS(CLBR_ANY & ~CLBR_RDI) SWAPGS gs_change: movl %edi,%gs @@ -1260,14 +1270,14 @@ ENTRY(call_softirq) CFI_REL_OFFSET rbp,0 mov %rsp,%rbp CFI_DEF_CFA_REGISTER rbp - incl %gs:pda_irqcount - cmove %gs:pda_irqstackptr,%rsp + incl PER_CPU_VAR(irq_count) + cmove PER_CPU_VAR(irq_stack_ptr),%rsp push %rbp # backlink for old unwinder call __do_softirq leaveq CFI_DEF_CFA_REGISTER rsp CFI_ADJUST_CFA_OFFSET -8 - decl %gs:pda_irqcount + decl PER_CPU_VAR(irq_count) ret CFI_ENDPROC END(call_softirq) @@ -1297,15 +1307,15 @@ ENTRY(xen_do_hypervisor_callback) # do movq %rdi, %rsp # we don't return, adjust the stack frame CFI_ENDPROC DEFAULT_FRAME -11: incl %gs:pda_irqcount +11: incl PER_CPU_VAR(irq_count) movq %rsp,%rbp CFI_DEF_CFA_REGISTER rbp - cmovzq %gs:pda_irqstackptr,%rsp + cmovzq PER_CPU_VAR(irq_stack_ptr),%rsp pushq %rbp # backlink for old unwinder call xen_evtchn_do_upcall popq %rsp CFI_DEF_CFA_REGISTER rsp - decl %gs:pda_irqcount + decl PER_CPU_VAR(irq_count) jmp error_exit CFI_ENDPROC END(do_hypervisor_callback) Index: linux-2.6-tip/arch/x86/kernel/es7000_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/es7000_32.c +++ /dev/null @@ -1,378 +0,0 @@ -/* - * Written by: Garry Forsgren, Unisys Corporation - * Natalie Protasevich, Unisys Corporation - * This file contains the code to configure and interface - * with Unisys ES7000 series hardware system manager. - * - * Copyright (c) 2003 Unisys Corporation. All Rights Reserved. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it would be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - * - * You should have received a copy of the GNU General Public License along - * with this program; if not, write the Free Software Foundation, Inc., 59 - * Temple Place - Suite 330, Boston MA 02111-1307, USA. - * - * Contact information: Unisys Corporation, Township Line & Union Meeting - * Roads-A, Unisys Way, Blue Bell, Pennsylvania, 19424, or: - * - * http://www.unisys.com - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* - * ES7000 chipsets - */ - -#define NON_UNISYS 0 -#define ES7000_CLASSIC 1 -#define ES7000_ZORRO 2 - - -#define MIP_REG 1 -#define MIP_PSAI_REG 4 - -#define MIP_BUSY 1 -#define MIP_SPIN 0xf0000 -#define MIP_VALID 0x0100000000000000ULL -#define MIP_PORT(VALUE) ((VALUE >> 32) & 0xffff) - -#define MIP_RD_LO(VALUE) (VALUE & 0xffffffff) - -struct mip_reg_info { - unsigned long long mip_info; - unsigned long long delivery_info; - unsigned long long host_reg; - unsigned long long mip_reg; -}; - -struct part_info { - unsigned char type; - unsigned char length; - unsigned char part_id; - unsigned char apic_mode; - unsigned long snum; - char ptype[16]; - char sname[64]; - char pname[64]; -}; - -struct psai { - unsigned long long entry_type; - unsigned long long addr; - unsigned long long bep_addr; -}; - -struct es7000_mem_info { - unsigned char type; - unsigned char length; - unsigned char resv[6]; - unsigned long long start; - unsigned long long size; -}; - -struct es7000_oem_table { - unsigned long long hdr; - struct mip_reg_info mip; - struct part_info pif; - struct es7000_mem_info shm; - struct psai psai; -}; - -#ifdef CONFIG_ACPI - -struct oem_table { - struct acpi_table_header Header; - u32 OEMTableAddr; - u32 OEMTableSize; -}; - -extern int find_unisys_acpi_oem_table(unsigned long *oem_addr); -extern void unmap_unisys_acpi_oem_table(unsigned long oem_addr); -#endif - -struct mip_reg { - unsigned long long off_0; - unsigned long long off_8; - unsigned long long off_10; - unsigned long long off_18; - unsigned long long off_20; - unsigned long long off_28; - unsigned long long off_30; - unsigned long long off_38; -}; - -#define MIP_SW_APIC 0x1020b -#define MIP_FUNC(VALUE) (VALUE & 0xff) - -/* - * ES7000 Globals - */ - -static volatile unsigned long *psai = NULL; -static struct mip_reg *mip_reg; -static struct mip_reg *host_reg; -static int mip_port; -static unsigned long mip_addr, host_addr; - -int es7000_plat; - -/* - * GSI override for ES7000 platforms. - */ - -static unsigned int base; - -static int -es7000_rename_gsi(int ioapic, int gsi) -{ - if (es7000_plat == ES7000_ZORRO) - return gsi; - - if (!base) { - int i; - for (i = 0; i < nr_ioapics; i++) - base += nr_ioapic_registers[i]; - } - - if (!ioapic && (gsi < 16)) - gsi += base; - return gsi; -} - -static int wakeup_secondary_cpu_via_mip(int cpu, unsigned long eip) -{ - unsigned long vect = 0, psaival = 0; - - if (psai == NULL) - return -1; - - vect = ((unsigned long)__pa(eip)/0x1000) << 16; - psaival = (0x1000000 | vect | cpu); - - while (*psai & 0x1000000) - ; - - *psai = psaival; - - return 0; -} - -static void noop_wait_for_deassert(atomic_t *deassert_not_used) -{ -} - -static int __init es7000_update_genapic(void) -{ - genapic->wakeup_cpu = wakeup_secondary_cpu_via_mip; - - /* MPENTIUMIII */ - if (boot_cpu_data.x86 == 6 && - (boot_cpu_data.x86_model >= 7 || boot_cpu_data.x86_model <= 11)) { - es7000_update_genapic_to_cluster(); - genapic->wait_for_init_deassert = noop_wait_for_deassert; - genapic->wakeup_cpu = wakeup_secondary_cpu_via_mip; - } - - return 0; -} - -void __init -setup_unisys(void) -{ - /* - * Determine the generation of the ES7000 currently running. - * - * es7000_plat = 1 if the machine is a 5xx ES7000 box - * es7000_plat = 2 if the machine is a x86_64 ES7000 box - * - */ - if (!(boot_cpu_data.x86 <= 15 && boot_cpu_data.x86_model <= 2)) - es7000_plat = ES7000_ZORRO; - else - es7000_plat = ES7000_CLASSIC; - ioapic_renumber_irq = es7000_rename_gsi; - - x86_quirks->update_genapic = es7000_update_genapic; -} - -/* - * Parse the OEM Table - */ - -int __init -parse_unisys_oem (char *oemptr) -{ - int i; - int success = 0; - unsigned char type, size; - unsigned long val; - char *tp = NULL; - struct psai *psaip = NULL; - struct mip_reg_info *mi; - struct mip_reg *host, *mip; - - tp = oemptr; - - tp += 8; - - for (i=0; i <= 6; i++) { - type = *tp++; - size = *tp++; - tp -= 2; - switch (type) { - case MIP_REG: - mi = (struct mip_reg_info *)tp; - val = MIP_RD_LO(mi->host_reg); - host_addr = val; - host = (struct mip_reg *)val; - host_reg = __va(host); - val = MIP_RD_LO(mi->mip_reg); - mip_port = MIP_PORT(mi->mip_info); - mip_addr = val; - mip = (struct mip_reg *)val; - mip_reg = __va(mip); - pr_debug("es7000_mipcfg: host_reg = 0x%lx \n", - (unsigned long)host_reg); - pr_debug("es7000_mipcfg: mip_reg = 0x%lx \n", - (unsigned long)mip_reg); - success++; - break; - case MIP_PSAI_REG: - psaip = (struct psai *)tp; - if (tp != NULL) { - if (psaip->addr) - psai = __va(psaip->addr); - else - psai = NULL; - success++; - } - break; - default: - break; - } - tp += size; - } - - if (success < 2) { - es7000_plat = NON_UNISYS; - } else - setup_unisys(); - return es7000_plat; -} - -#ifdef CONFIG_ACPI -static unsigned long oem_addrX; -static unsigned long oem_size; -int __init find_unisys_acpi_oem_table(unsigned long *oem_addr) -{ - struct acpi_table_header *header = NULL; - int i = 0; - - while (ACPI_SUCCESS(acpi_get_table("OEM1", i++, &header))) { - if (!memcmp((char *) &header->oem_id, "UNISYS", 6)) { - struct oem_table *t = (struct oem_table *)header; - - oem_addrX = t->OEMTableAddr; - oem_size = t->OEMTableSize; - - *oem_addr = (unsigned long)__acpi_map_table(oem_addrX, - oem_size); - return 0; - } - } - return -1; -} - -void __init unmap_unisys_acpi_oem_table(unsigned long oem_addr) -{ -} -#endif - -static void -es7000_spin(int n) -{ - int i = 0; - - while (i++ < n) - rep_nop(); -} - -static int __init -es7000_mip_write(struct mip_reg *mip_reg) -{ - int status = 0; - int spin; - - spin = MIP_SPIN; - while (((unsigned long long)host_reg->off_38 & - (unsigned long long)MIP_VALID) != 0) { - if (--spin <= 0) { - printk("es7000_mip_write: Timeout waiting for Host Valid Flag"); - return -1; - } - es7000_spin(MIP_SPIN); - } - - memcpy(host_reg, mip_reg, sizeof(struct mip_reg)); - outb(1, mip_port); - - spin = MIP_SPIN; - - while (((unsigned long long)mip_reg->off_38 & - (unsigned long long)MIP_VALID) == 0) { - if (--spin <= 0) { - printk("es7000_mip_write: Timeout waiting for MIP Valid Flag"); - return -1; - } - es7000_spin(MIP_SPIN); - } - - status = ((unsigned long long)mip_reg->off_0 & - (unsigned long long)0xffff0000000000ULL) >> 48; - mip_reg->off_38 = ((unsigned long long)mip_reg->off_38 & - (unsigned long long)~MIP_VALID); - return status; -} - -void __init -es7000_sw_apic(void) -{ - if (es7000_plat) { - int mip_status; - struct mip_reg es7000_mip_reg; - - printk("ES7000: Enabling APIC mode.\n"); - memset(&es7000_mip_reg, 0, sizeof(struct mip_reg)); - es7000_mip_reg.off_0 = MIP_SW_APIC; - es7000_mip_reg.off_38 = (MIP_VALID); - while ((mip_status = es7000_mip_write(&es7000_mip_reg)) != 0) - printk("es7000_sw_apic: command failed, status = %x\n", - mip_status); - return; - } -} Index: linux-2.6-tip/arch/x86/kernel/ftrace.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/ftrace.c +++ linux-2.6-tip/arch/x86/kernel/ftrace.c @@ -18,6 +18,7 @@ #include #include +#include #include #include #include @@ -26,6 +27,18 @@ #ifdef CONFIG_DYNAMIC_FTRACE +int ftrace_arch_code_modify_prepare(void) +{ + set_kernel_text_rw(); + return 0; +} + +int ftrace_arch_code_modify_post_process(void) +{ + set_kernel_text_ro(); + return 0; +} + union ftrace_code_union { char code[MCOUNT_INSN_SIZE]; struct { @@ -82,7 +95,7 @@ static unsigned char *ftrace_call_replac * are the same as what exists. */ -static atomic_t in_nmi = ATOMIC_INIT(0); +static atomic_t nmi_running = ATOMIC_INIT(0); static int mod_code_status; /* holds return value of text write */ static int mod_code_write; /* set when NMI should do the write */ static void *mod_code_ip; /* holds the IP to write to */ @@ -111,12 +124,16 @@ static void ftrace_mod_code(void) */ mod_code_status = probe_kernel_write(mod_code_ip, mod_code_newcode, MCOUNT_INSN_SIZE); + + /* if we fail, then kill any new writers */ + if (mod_code_status) + mod_code_write = 0; } void ftrace_nmi_enter(void) { - atomic_inc(&in_nmi); - /* Must have in_nmi seen before reading write flag */ + atomic_inc(&nmi_running); + /* Must have nmi_running seen before reading write flag */ smp_mb(); if (mod_code_write) { ftrace_mod_code(); @@ -126,22 +143,21 @@ void ftrace_nmi_enter(void) void ftrace_nmi_exit(void) { - /* Finish all executions before clearing in_nmi */ + /* Finish all executions before clearing nmi_running */ smp_wmb(); - atomic_dec(&in_nmi); + atomic_dec(&nmi_running); } static void wait_for_nmi(void) { - int waited = 0; + if (!atomic_read(&nmi_running)) + return; - while (atomic_read(&in_nmi)) { - waited = 1; + do { cpu_relax(); - } + } while (atomic_read(&nmi_running)); - if (waited) - nmi_wait_count++; + nmi_wait_count++; } static int @@ -368,100 +384,8 @@ int ftrace_disable_ftrace_graph_caller(v return ftrace_mod_jmp(ip, old_offset, new_offset); } -#else /* CONFIG_DYNAMIC_FTRACE */ - -/* - * These functions are picked from those used on - * this page for dynamic ftrace. They have been - * simplified to ignore all traces in NMI context. - */ -static atomic_t in_nmi; - -void ftrace_nmi_enter(void) -{ - atomic_inc(&in_nmi); -} - -void ftrace_nmi_exit(void) -{ - atomic_dec(&in_nmi); -} - #endif /* !CONFIG_DYNAMIC_FTRACE */ -/* Add a function return address to the trace stack on thread info.*/ -static int push_return_trace(unsigned long ret, unsigned long long time, - unsigned long func, int *depth) -{ - int index; - - if (!current->ret_stack) - return -EBUSY; - - /* The return trace stack is full */ - if (current->curr_ret_stack == FTRACE_RETFUNC_DEPTH - 1) { - atomic_inc(¤t->trace_overrun); - return -EBUSY; - } - - index = ++current->curr_ret_stack; - barrier(); - current->ret_stack[index].ret = ret; - current->ret_stack[index].func = func; - current->ret_stack[index].calltime = time; - *depth = index; - - return 0; -} - -/* Retrieve a function return address to the trace stack on thread info.*/ -static void pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret) -{ - int index; - - index = current->curr_ret_stack; - - if (unlikely(index < 0)) { - ftrace_graph_stop(); - WARN_ON(1); - /* Might as well panic, otherwise we have no where to go */ - *ret = (unsigned long)panic; - return; - } - - *ret = current->ret_stack[index].ret; - trace->func = current->ret_stack[index].func; - trace->calltime = current->ret_stack[index].calltime; - trace->overrun = atomic_read(¤t->trace_overrun); - trace->depth = index; - barrier(); - current->curr_ret_stack--; - -} - -/* - * Send the trace to the ring-buffer. - * @return the original return address. - */ -unsigned long ftrace_return_to_handler(void) -{ - struct ftrace_graph_ret trace; - unsigned long ret; - - pop_return_trace(&trace, &ret); - trace.rettime = cpu_clock(raw_smp_processor_id()); - ftrace_graph_return(&trace); - - if (unlikely(!ret)) { - ftrace_graph_stop(); - WARN_ON(1); - /* Might as well panic. What else to do? */ - ret = (unsigned long)panic; - } - - return ret; -} - /* * Hook the return address and push it in the stack of return addrs * in current thread info. @@ -476,7 +400,7 @@ void prepare_ftrace_return(unsigned long &return_to_handler; /* Nmi's are currently unsupported */ - if (unlikely(atomic_read(&in_nmi))) + if (unlikely(in_nmi())) return; if (unlikely(atomic_read(¤t->tracing_graph_pause))) @@ -512,16 +436,9 @@ void prepare_ftrace_return(unsigned long return; } - if (unlikely(!__kernel_text_address(old))) { - ftrace_graph_stop(); - *parent = old; - WARN_ON(1); - return; - } - calltime = cpu_clock(raw_smp_processor_id()); - if (push_return_trace(old, calltime, + if (ftrace_push_return_trace(old, calltime, self_addr, &trace.depth) == -EBUSY) { *parent = old; return; Index: linux-2.6-tip/arch/x86/kernel/genapic_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/genapic_64.c +++ /dev/null @@ -1,82 +0,0 @@ -/* - * Copyright 2004 James Cleverdon, IBM. - * Subject to the GNU Public License, v.2 - * - * Generic APIC sub-arch probe layer. - * - * Hacked for x86-64 by James Cleverdon from i386 architecture code by - * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and - * James Cleverdon. - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include - -extern struct genapic apic_flat; -extern struct genapic apic_physflat; -extern struct genapic apic_x2xpic_uv_x; -extern struct genapic apic_x2apic_phys; -extern struct genapic apic_x2apic_cluster; - -struct genapic __read_mostly *genapic = &apic_flat; - -static struct genapic *apic_probe[] __initdata = { - &apic_x2apic_uv_x, - &apic_x2apic_phys, - &apic_x2apic_cluster, - &apic_physflat, - NULL, -}; - -/* - * Check the APIC IDs in bios_cpu_apicid and choose the APIC mode. - */ -void __init setup_apic_routing(void) -{ - if (genapic == &apic_x2apic_phys || genapic == &apic_x2apic_cluster) { - if (!intr_remapping_enabled) - genapic = &apic_flat; - } - - if (genapic == &apic_flat) { - if (max_physical_apicid >= 8) - genapic = &apic_physflat; - printk(KERN_INFO "Setting APIC routing to %s\n", genapic->name); - } - - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); -} - -/* Same for both flat and physical. */ - -void apic_send_IPI_self(int vector) -{ - __send_IPI_shortcut(APIC_DEST_SELF, vector, APIC_DEST_PHYSICAL); -} - -int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - int i; - - for (i = 0; apic_probe[i]; ++i) { - if (apic_probe[i]->acpi_madt_oem_check(oem_id, oem_table_id)) { - genapic = apic_probe[i]; - printk(KERN_INFO "Setting APIC routing to %s.\n", - genapic->name); - return 1; - } - } - return 0; -} Index: linux-2.6-tip/arch/x86/kernel/genapic_flat_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/genapic_flat_64.c +++ /dev/null @@ -1,307 +0,0 @@ -/* - * Copyright 2004 James Cleverdon, IBM. - * Subject to the GNU Public License, v.2 - * - * Flat APIC subarch code. - * - * Hacked for x86-64 by James Cleverdon from i386 architecture code by - * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and - * James Cleverdon. - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#ifdef CONFIG_ACPI -#include -#endif - -static int flat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - return 1; -} - -static const struct cpumask *flat_target_cpus(void) -{ - return cpu_online_mask; -} - -static void flat_vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - /* Careful. Some cpus do not strictly honor the set of cpus - * specified in the interrupt destination when using lowest - * priority interrupt delivery mode. - * - * In particular there was a hyperthreading cpu observed to - * deliver interrupts to the wrong hyperthread when only one - * hyperthread was specified in the interrupt desitination. - */ - cpumask_clear(retmask); - cpumask_bits(retmask)[0] = APIC_ALL_CPUS; -} - -/* - * Set up the logical destination ID. - * - * Intel recommends to set DFR, LDR and TPR before enabling - * an APIC. See e.g. "AP-388 82489DX User's Manual" (Intel - * document number 292116). So here it goes... - */ -static void flat_init_apic_ldr(void) -{ - unsigned long val; - unsigned long num, id; - - num = smp_processor_id(); - id = 1UL << num; - apic_write(APIC_DFR, APIC_DFR_FLAT); - val = apic_read(APIC_LDR) & ~APIC_LDR_MASK; - val |= SET_APIC_LOGICAL_ID(id); - apic_write(APIC_LDR, val); -} - -static inline void _flat_send_IPI_mask(unsigned long mask, int vector) -{ - unsigned long flags; - - local_irq_save(flags); - __send_IPI_dest_field(mask, vector, APIC_DEST_LOGICAL); - local_irq_restore(flags); -} - -static void flat_send_IPI_mask(const struct cpumask *cpumask, int vector) -{ - unsigned long mask = cpumask_bits(cpumask)[0]; - - _flat_send_IPI_mask(mask, vector); -} - -static void flat_send_IPI_mask_allbutself(const struct cpumask *cpumask, - int vector) -{ - unsigned long mask = cpumask_bits(cpumask)[0]; - int cpu = smp_processor_id(); - - if (cpu < BITS_PER_LONG) - clear_bit(cpu, &mask); - _flat_send_IPI_mask(mask, vector); -} - -static void flat_send_IPI_allbutself(int vector) -{ - int cpu = smp_processor_id(); -#ifdef CONFIG_HOTPLUG_CPU - int hotplug = 1; -#else - int hotplug = 0; -#endif - if (hotplug || vector == NMI_VECTOR) { - if (!cpumask_equal(cpu_online_mask, cpumask_of(cpu))) { - unsigned long mask = cpumask_bits(cpu_online_mask)[0]; - - if (cpu < BITS_PER_LONG) - clear_bit(cpu, &mask); - - _flat_send_IPI_mask(mask, vector); - } - } else if (num_online_cpus() > 1) { - __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL); - } -} - -static void flat_send_IPI_all(int vector) -{ - if (vector == NMI_VECTOR) - flat_send_IPI_mask(cpu_online_mask, vector); - else - __send_IPI_shortcut(APIC_DEST_ALLINC, vector, APIC_DEST_LOGICAL); -} - -static unsigned int get_apic_id(unsigned long x) -{ - unsigned int id; - - id = (((x)>>24) & 0xFFu); - return id; -} - -static unsigned long set_apic_id(unsigned int id) -{ - unsigned long x; - - x = ((id & 0xFFu)<<24); - return x; -} - -static unsigned int read_xapic_id(void) -{ - unsigned int id; - - id = get_apic_id(apic_read(APIC_ID)); - return id; -} - -static int flat_apic_id_registered(void) -{ - return physid_isset(read_xapic_id(), phys_cpu_present_map); -} - -static unsigned int flat_cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - return cpumask_bits(cpumask)[0] & APIC_ALL_CPUS; -} - -static unsigned int flat_cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - unsigned long mask1 = cpumask_bits(cpumask)[0] & APIC_ALL_CPUS; - unsigned long mask2 = cpumask_bits(andmask)[0] & APIC_ALL_CPUS; - - return mask1 & mask2; -} - -static unsigned int phys_pkg_id(int index_msb) -{ - return hard_smp_processor_id() >> index_msb; -} - -struct genapic apic_flat = { - .name = "flat", - .acpi_madt_oem_check = flat_acpi_madt_oem_check, - .int_delivery_mode = dest_LowestPrio, - .int_dest_mode = (APIC_DEST_LOGICAL != 0), - .target_cpus = flat_target_cpus, - .vector_allocation_domain = flat_vector_allocation_domain, - .apic_id_registered = flat_apic_id_registered, - .init_apic_ldr = flat_init_apic_ldr, - .send_IPI_all = flat_send_IPI_all, - .send_IPI_allbutself = flat_send_IPI_allbutself, - .send_IPI_mask = flat_send_IPI_mask, - .send_IPI_mask_allbutself = flat_send_IPI_mask_allbutself, - .send_IPI_self = apic_send_IPI_self, - .cpu_mask_to_apicid = flat_cpu_mask_to_apicid, - .cpu_mask_to_apicid_and = flat_cpu_mask_to_apicid_and, - .phys_pkg_id = phys_pkg_id, - .get_apic_id = get_apic_id, - .set_apic_id = set_apic_id, - .apic_id_mask = (0xFFu<<24), -}; - -/* - * Physflat mode is used when there are more than 8 CPUs on a AMD system. - * We cannot use logical delivery in this case because the mask - * overflows, so use physical mode. - */ -static int physflat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ -#ifdef CONFIG_ACPI - /* - * Quirk: some x86_64 machines can only use physical APIC mode - * regardless of how many processors are present (x86_64 ES7000 - * is an example). - */ - if (acpi_gbl_FADT.header.revision > FADT2_REVISION_ID && - (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL)) { - printk(KERN_DEBUG "system APIC only can use physical flat"); - return 1; - } -#endif - - return 0; -} - -static const struct cpumask *physflat_target_cpus(void) -{ - return cpu_online_mask; -} - -static void physflat_vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - cpumask_clear(retmask); - cpumask_set_cpu(cpu, retmask); -} - -static void physflat_send_IPI_mask(const struct cpumask *cpumask, int vector) -{ - send_IPI_mask_sequence(cpumask, vector); -} - -static void physflat_send_IPI_mask_allbutself(const struct cpumask *cpumask, - int vector) -{ - send_IPI_mask_allbutself(cpumask, vector); -} - -static void physflat_send_IPI_allbutself(int vector) -{ - send_IPI_mask_allbutself(cpu_online_mask, vector); -} - -static void physflat_send_IPI_all(int vector) -{ - physflat_send_IPI_mask(cpu_online_mask, vector); -} - -static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - cpu = cpumask_first(cpumask); - if ((unsigned)cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - else - return BAD_APICID; -} - -static unsigned int -physflat_cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - for_each_cpu_and(cpu, cpumask, andmask) - if (cpumask_test_cpu(cpu, cpu_online_mask)) - break; - if (cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - return BAD_APICID; -} - -struct genapic apic_physflat = { - .name = "physical flat", - .acpi_madt_oem_check = physflat_acpi_madt_oem_check, - .int_delivery_mode = dest_Fixed, - .int_dest_mode = (APIC_DEST_PHYSICAL != 0), - .target_cpus = physflat_target_cpus, - .vector_allocation_domain = physflat_vector_allocation_domain, - .apic_id_registered = flat_apic_id_registered, - .init_apic_ldr = flat_init_apic_ldr,/*not needed, but shouldn't hurt*/ - .send_IPI_all = physflat_send_IPI_all, - .send_IPI_allbutself = physflat_send_IPI_allbutself, - .send_IPI_mask = physflat_send_IPI_mask, - .send_IPI_mask_allbutself = physflat_send_IPI_mask_allbutself, - .send_IPI_self = apic_send_IPI_self, - .cpu_mask_to_apicid = physflat_cpu_mask_to_apicid, - .cpu_mask_to_apicid_and = physflat_cpu_mask_to_apicid_and, - .phys_pkg_id = phys_pkg_id, - .get_apic_id = get_apic_id, - .set_apic_id = set_apic_id, - .apic_id_mask = (0xFFu<<24), -}; Index: linux-2.6-tip/arch/x86/kernel/genx2apic_cluster.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/genx2apic_cluster.c +++ /dev/null @@ -1,198 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid); - -static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - if (cpu_has_x2apic) - return 1; - - return 0; -} - -/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ - -static const struct cpumask *x2apic_target_cpus(void) -{ - return cpumask_of(0); -} - -/* - * for now each logical cpu is in its own vector allocation domain. - */ -static void x2apic_vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - cpumask_clear(retmask); - cpumask_set_cpu(cpu, retmask); -} - -static void __x2apic_send_IPI_dest(unsigned int apicid, int vector, - unsigned int dest) -{ - unsigned long cfg; - - cfg = __prepare_ICR(0, vector, dest); - - /* - * send the IPI. - */ - x2apic_icr_write(cfg, apicid); -} - -/* - * for now, we send the IPI's one by one in the cpumask. - * TBD: Based on the cpu mask, we can send the IPI's to the cluster group - * at once. We have 16 cpu's in a cluster. This will minimize IPI register - * writes. - */ -static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector) -{ - unsigned long flags; - unsigned long query_cpu; - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) - __x2apic_send_IPI_dest( - per_cpu(x86_cpu_to_logical_apicid, query_cpu), - vector, APIC_DEST_LOGICAL); - local_irq_restore(flags); -} - -static void x2apic_send_IPI_mask_allbutself(const struct cpumask *mask, - int vector) -{ - unsigned long flags; - unsigned long query_cpu; - unsigned long this_cpu = smp_processor_id(); - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) - if (query_cpu != this_cpu) - __x2apic_send_IPI_dest( - per_cpu(x86_cpu_to_logical_apicid, query_cpu), - vector, APIC_DEST_LOGICAL); - local_irq_restore(flags); -} - -static void x2apic_send_IPI_allbutself(int vector) -{ - unsigned long flags; - unsigned long query_cpu; - unsigned long this_cpu = smp_processor_id(); - - local_irq_save(flags); - for_each_online_cpu(query_cpu) - if (query_cpu != this_cpu) - __x2apic_send_IPI_dest( - per_cpu(x86_cpu_to_logical_apicid, query_cpu), - vector, APIC_DEST_LOGICAL); - local_irq_restore(flags); -} - -static void x2apic_send_IPI_all(int vector) -{ - x2apic_send_IPI_mask(cpu_online_mask, vector); -} - -static int x2apic_apic_id_registered(void) -{ - return 1; -} - -static unsigned int x2apic_cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one logical APIC ID. - * May as well be the first. - */ - cpu = cpumask_first(cpumask); - if ((unsigned)cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_logical_apicid, cpu); - else - return BAD_APICID; -} - -static unsigned int x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one logical APIC ID. - * May as well be the first. - */ - for_each_cpu_and(cpu, cpumask, andmask) - if (cpumask_test_cpu(cpu, cpu_online_mask)) - break; - if (cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_logical_apicid, cpu); - return BAD_APICID; -} - -static unsigned int get_apic_id(unsigned long x) -{ - unsigned int id; - - id = x; - return id; -} - -static unsigned long set_apic_id(unsigned int id) -{ - unsigned long x; - - x = id; - return x; -} - -static unsigned int phys_pkg_id(int index_msb) -{ - return current_cpu_data.initial_apicid >> index_msb; -} - -static void x2apic_send_IPI_self(int vector) -{ - apic_write(APIC_SELF_IPI, vector); -} - -static void init_x2apic_ldr(void) -{ - int cpu = smp_processor_id(); - - per_cpu(x86_cpu_to_logical_apicid, cpu) = apic_read(APIC_LDR); - return; -} - -struct genapic apic_x2apic_cluster = { - .name = "cluster x2apic", - .acpi_madt_oem_check = x2apic_acpi_madt_oem_check, - .int_delivery_mode = dest_LowestPrio, - .int_dest_mode = (APIC_DEST_LOGICAL != 0), - .target_cpus = x2apic_target_cpus, - .vector_allocation_domain = x2apic_vector_allocation_domain, - .apic_id_registered = x2apic_apic_id_registered, - .init_apic_ldr = init_x2apic_ldr, - .send_IPI_all = x2apic_send_IPI_all, - .send_IPI_allbutself = x2apic_send_IPI_allbutself, - .send_IPI_mask = x2apic_send_IPI_mask, - .send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself, - .send_IPI_self = x2apic_send_IPI_self, - .cpu_mask_to_apicid = x2apic_cpu_mask_to_apicid, - .cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and, - .phys_pkg_id = phys_pkg_id, - .get_apic_id = get_apic_id, - .set_apic_id = set_apic_id, - .apic_id_mask = (0xFFFFFFFFu), -}; Index: linux-2.6-tip/arch/x86/kernel/genx2apic_phys.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/genx2apic_phys.c +++ /dev/null @@ -1,194 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -static int x2apic_phys; - -static int set_x2apic_phys_mode(char *arg) -{ - x2apic_phys = 1; - return 0; -} -early_param("x2apic_phys", set_x2apic_phys_mode); - -static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - if (cpu_has_x2apic && x2apic_phys) - return 1; - - return 0; -} - -/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ - -static const struct cpumask *x2apic_target_cpus(void) -{ - return cpumask_of(0); -} - -static void x2apic_vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - cpumask_clear(retmask); - cpumask_set_cpu(cpu, retmask); -} - -static void __x2apic_send_IPI_dest(unsigned int apicid, int vector, - unsigned int dest) -{ - unsigned long cfg; - - cfg = __prepare_ICR(0, vector, dest); - - /* - * send the IPI. - */ - x2apic_icr_write(cfg, apicid); -} - -static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector) -{ - unsigned long flags; - unsigned long query_cpu; - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) { - __x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu), - vector, APIC_DEST_PHYSICAL); - } - local_irq_restore(flags); -} - -static void x2apic_send_IPI_mask_allbutself(const struct cpumask *mask, - int vector) -{ - unsigned long flags; - unsigned long query_cpu; - unsigned long this_cpu = smp_processor_id(); - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) { - if (query_cpu != this_cpu) - __x2apic_send_IPI_dest( - per_cpu(x86_cpu_to_apicid, query_cpu), - vector, APIC_DEST_PHYSICAL); - } - local_irq_restore(flags); -} - -static void x2apic_send_IPI_allbutself(int vector) -{ - unsigned long flags; - unsigned long query_cpu; - unsigned long this_cpu = smp_processor_id(); - - local_irq_save(flags); - for_each_online_cpu(query_cpu) - if (query_cpu != this_cpu) - __x2apic_send_IPI_dest( - per_cpu(x86_cpu_to_apicid, query_cpu), - vector, APIC_DEST_PHYSICAL); - local_irq_restore(flags); -} - -static void x2apic_send_IPI_all(int vector) -{ - x2apic_send_IPI_mask(cpu_online_mask, vector); -} - -static int x2apic_apic_id_registered(void) -{ - return 1; -} - -static unsigned int x2apic_cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - cpu = cpumask_first(cpumask); - if ((unsigned)cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - else - return BAD_APICID; -} - -static unsigned int x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - for_each_cpu_and(cpu, cpumask, andmask) - if (cpumask_test_cpu(cpu, cpu_online_mask)) - break; - if (cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - return BAD_APICID; -} - -static unsigned int get_apic_id(unsigned long x) -{ - unsigned int id; - - id = x; - return id; -} - -static unsigned long set_apic_id(unsigned int id) -{ - unsigned long x; - - x = id; - return x; -} - -static unsigned int phys_pkg_id(int index_msb) -{ - return current_cpu_data.initial_apicid >> index_msb; -} - -static void x2apic_send_IPI_self(int vector) -{ - apic_write(APIC_SELF_IPI, vector); -} - -static void init_x2apic_ldr(void) -{ - return; -} - -struct genapic apic_x2apic_phys = { - .name = "physical x2apic", - .acpi_madt_oem_check = x2apic_acpi_madt_oem_check, - .int_delivery_mode = dest_Fixed, - .int_dest_mode = (APIC_DEST_PHYSICAL != 0), - .target_cpus = x2apic_target_cpus, - .vector_allocation_domain = x2apic_vector_allocation_domain, - .apic_id_registered = x2apic_apic_id_registered, - .init_apic_ldr = init_x2apic_ldr, - .send_IPI_all = x2apic_send_IPI_all, - .send_IPI_allbutself = x2apic_send_IPI_allbutself, - .send_IPI_mask = x2apic_send_IPI_mask, - .send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself, - .send_IPI_self = x2apic_send_IPI_self, - .cpu_mask_to_apicid = x2apic_cpu_mask_to_apicid, - .cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and, - .phys_pkg_id = phys_pkg_id, - .get_apic_id = get_apic_id, - .set_apic_id = set_apic_id, - .apic_id_mask = (0xFFFFFFFFu), -}; Index: linux-2.6-tip/arch/x86/kernel/genx2apic_uv_x.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/genx2apic_uv_x.c +++ /dev/null @@ -1,600 +0,0 @@ -/* - * This file is subject to the terms and conditions of the GNU General Public - * License. See the file "COPYING" in the main directory of this archive - * for more details. - * - * SGI UV APIC functions (note: not an Intel compatible APIC) - * - * Copyright (C) 2007-2008 Silicon Graphics, Inc. All rights reserved. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -DEFINE_PER_CPU(int, x2apic_extra_bits); - -static enum uv_system_type uv_system_type; - -static int uv_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - if (!strcmp(oem_id, "SGI")) { - if (!strcmp(oem_table_id, "UVL")) - uv_system_type = UV_LEGACY_APIC; - else if (!strcmp(oem_table_id, "UVX")) - uv_system_type = UV_X2APIC; - else if (!strcmp(oem_table_id, "UVH")) { - uv_system_type = UV_NON_UNIQUE_APIC; - return 1; - } - } - return 0; -} - -enum uv_system_type get_uv_system_type(void) -{ - return uv_system_type; -} - -int is_uv_system(void) -{ - return uv_system_type != UV_NONE; -} -EXPORT_SYMBOL_GPL(is_uv_system); - -DEFINE_PER_CPU(struct uv_hub_info_s, __uv_hub_info); -EXPORT_PER_CPU_SYMBOL_GPL(__uv_hub_info); - -struct uv_blade_info *uv_blade_info; -EXPORT_SYMBOL_GPL(uv_blade_info); - -short *uv_node_to_blade; -EXPORT_SYMBOL_GPL(uv_node_to_blade); - -short *uv_cpu_to_blade; -EXPORT_SYMBOL_GPL(uv_cpu_to_blade); - -short uv_possible_blades; -EXPORT_SYMBOL_GPL(uv_possible_blades); - -unsigned long sn_rtc_cycles_per_second; -EXPORT_SYMBOL(sn_rtc_cycles_per_second); - -/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ - -static const struct cpumask *uv_target_cpus(void) -{ - return cpumask_of(0); -} - -static void uv_vector_allocation_domain(int cpu, struct cpumask *retmask) -{ - cpumask_clear(retmask); - cpumask_set_cpu(cpu, retmask); -} - -int uv_wakeup_secondary(int phys_apicid, unsigned int start_rip) -{ - unsigned long val; - int pnode; - - pnode = uv_apicid_to_pnode(phys_apicid); - val = (1UL << UVH_IPI_INT_SEND_SHFT) | - (phys_apicid << UVH_IPI_INT_APIC_ID_SHFT) | - (((long)start_rip << UVH_IPI_INT_VECTOR_SHFT) >> 12) | - APIC_DM_INIT; - uv_write_global_mmr64(pnode, UVH_IPI_INT, val); - mdelay(10); - - val = (1UL << UVH_IPI_INT_SEND_SHFT) | - (phys_apicid << UVH_IPI_INT_APIC_ID_SHFT) | - (((long)start_rip << UVH_IPI_INT_VECTOR_SHFT) >> 12) | - APIC_DM_STARTUP; - uv_write_global_mmr64(pnode, UVH_IPI_INT, val); - return 0; -} - -static void uv_send_IPI_one(int cpu, int vector) -{ - unsigned long val, apicid, lapicid; - int pnode; - - apicid = per_cpu(x86_cpu_to_apicid, cpu); - lapicid = apicid & 0x3f; /* ZZZ macro needed */ - pnode = uv_apicid_to_pnode(apicid); - val = - (1UL << UVH_IPI_INT_SEND_SHFT) | (lapicid << - UVH_IPI_INT_APIC_ID_SHFT) | - (vector << UVH_IPI_INT_VECTOR_SHFT); - uv_write_global_mmr64(pnode, UVH_IPI_INT, val); -} - -static void uv_send_IPI_mask(const struct cpumask *mask, int vector) -{ - unsigned int cpu; - - for_each_cpu(cpu, mask) - uv_send_IPI_one(cpu, vector); -} - -static void uv_send_IPI_mask_allbutself(const struct cpumask *mask, int vector) -{ - unsigned int cpu; - unsigned int this_cpu = smp_processor_id(); - - for_each_cpu(cpu, mask) - if (cpu != this_cpu) - uv_send_IPI_one(cpu, vector); -} - -static void uv_send_IPI_allbutself(int vector) -{ - unsigned int cpu; - unsigned int this_cpu = smp_processor_id(); - - for_each_online_cpu(cpu) - if (cpu != this_cpu) - uv_send_IPI_one(cpu, vector); -} - -static void uv_send_IPI_all(int vector) -{ - uv_send_IPI_mask(cpu_online_mask, vector); -} - -static int uv_apic_id_registered(void) -{ - return 1; -} - -static void uv_init_apic_ldr(void) -{ -} - -static unsigned int uv_cpu_mask_to_apicid(const struct cpumask *cpumask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - cpu = cpumask_first(cpumask); - if ((unsigned)cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - else - return BAD_APICID; -} - -static unsigned int uv_cpu_mask_to_apicid_and(const struct cpumask *cpumask, - const struct cpumask *andmask) -{ - int cpu; - - /* - * We're using fixed IRQ delivery, can only return one phys APIC ID. - * May as well be the first. - */ - for_each_cpu_and(cpu, cpumask, andmask) - if (cpumask_test_cpu(cpu, cpu_online_mask)) - break; - if (cpu < nr_cpu_ids) - return per_cpu(x86_cpu_to_apicid, cpu); - return BAD_APICID; -} - -static unsigned int get_apic_id(unsigned long x) -{ - unsigned int id; - - WARN_ON(preemptible() && num_online_cpus() > 1); - id = x | __get_cpu_var(x2apic_extra_bits); - - return id; -} - -static unsigned long set_apic_id(unsigned int id) -{ - unsigned long x; - - /* maskout x2apic_extra_bits ? */ - x = id; - return x; -} - -static unsigned int uv_read_apic_id(void) -{ - - return get_apic_id(apic_read(APIC_ID)); -} - -static unsigned int phys_pkg_id(int index_msb) -{ - return uv_read_apic_id() >> index_msb; -} - -static void uv_send_IPI_self(int vector) -{ - apic_write(APIC_SELF_IPI, vector); -} - -struct genapic apic_x2apic_uv_x = { - .name = "UV large system", - .acpi_madt_oem_check = uv_acpi_madt_oem_check, - .int_delivery_mode = dest_Fixed, - .int_dest_mode = (APIC_DEST_PHYSICAL != 0), - .target_cpus = uv_target_cpus, - .vector_allocation_domain = uv_vector_allocation_domain, - .apic_id_registered = uv_apic_id_registered, - .init_apic_ldr = uv_init_apic_ldr, - .send_IPI_all = uv_send_IPI_all, - .send_IPI_allbutself = uv_send_IPI_allbutself, - .send_IPI_mask = uv_send_IPI_mask, - .send_IPI_mask_allbutself = uv_send_IPI_mask_allbutself, - .send_IPI_self = uv_send_IPI_self, - .cpu_mask_to_apicid = uv_cpu_mask_to_apicid, - .cpu_mask_to_apicid_and = uv_cpu_mask_to_apicid_and, - .phys_pkg_id = phys_pkg_id, - .get_apic_id = get_apic_id, - .set_apic_id = set_apic_id, - .apic_id_mask = (0xFFFFFFFFu), -}; - -static __cpuinit void set_x2apic_extra_bits(int pnode) -{ - __get_cpu_var(x2apic_extra_bits) = (pnode << 6); -} - -/* - * Called on boot cpu. - */ -static __init int boot_pnode_to_blade(int pnode) -{ - int blade; - - for (blade = 0; blade < uv_num_possible_blades(); blade++) - if (pnode == uv_blade_info[blade].pnode) - return blade; - BUG(); -} - -struct redir_addr { - unsigned long redirect; - unsigned long alias; -}; - -#define DEST_SHIFT UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_0_MMR_DEST_BASE_SHFT - -static __initdata struct redir_addr redir_addrs[] = { - {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_0_MMR, UVH_SI_ALIAS0_OVERLAY_CONFIG}, - {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_1_MMR, UVH_SI_ALIAS1_OVERLAY_CONFIG}, - {UVH_RH_GAM_ALIAS210_REDIRECT_CONFIG_2_MMR, UVH_SI_ALIAS2_OVERLAY_CONFIG}, -}; - -static __init void get_lowmem_redirect(unsigned long *base, unsigned long *size) -{ - union uvh_si_alias0_overlay_config_u alias; - union uvh_rh_gam_alias210_redirect_config_2_mmr_u redirect; - int i; - - for (i = 0; i < ARRAY_SIZE(redir_addrs); i++) { - alias.v = uv_read_local_mmr(redir_addrs[i].alias); - if (alias.s.base == 0) { - *size = (1UL << alias.s.m_alias); - redirect.v = uv_read_local_mmr(redir_addrs[i].redirect); - *base = (unsigned long)redirect.s.dest_base << DEST_SHIFT; - return; - } - } - BUG(); -} - -static __init void map_low_mmrs(void) -{ - init_extra_mapping_uc(UV_GLOBAL_MMR32_BASE, UV_GLOBAL_MMR32_SIZE); - init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE); -} - -enum map_type {map_wb, map_uc}; - -static __init void map_high(char *id, unsigned long base, int shift, - int max_pnode, enum map_type map_type) -{ - unsigned long bytes, paddr; - - paddr = base << shift; - bytes = (1UL << shift) * (max_pnode + 1); - printk(KERN_INFO "UV: Map %s_HI 0x%lx - 0x%lx\n", id, paddr, - paddr + bytes); - if (map_type == map_uc) - init_extra_mapping_uc(paddr, bytes); - else - init_extra_mapping_wb(paddr, bytes); - -} -static __init void map_gru_high(int max_pnode) -{ - union uvh_rh_gam_gru_overlay_config_mmr_u gru; - int shift = UVH_RH_GAM_GRU_OVERLAY_CONFIG_MMR_BASE_SHFT; - - gru.v = uv_read_local_mmr(UVH_RH_GAM_GRU_OVERLAY_CONFIG_MMR); - if (gru.s.enable) - map_high("GRU", gru.s.base, shift, max_pnode, map_wb); -} - -static __init void map_config_high(int max_pnode) -{ - union uvh_rh_gam_cfg_overlay_config_mmr_u cfg; - int shift = UVH_RH_GAM_CFG_OVERLAY_CONFIG_MMR_BASE_SHFT; - - cfg.v = uv_read_local_mmr(UVH_RH_GAM_CFG_OVERLAY_CONFIG_MMR); - if (cfg.s.enable) - map_high("CONFIG", cfg.s.base, shift, max_pnode, map_uc); -} - -static __init void map_mmr_high(int max_pnode) -{ - union uvh_rh_gam_mmr_overlay_config_mmr_u mmr; - int shift = UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR_BASE_SHFT; - - mmr.v = uv_read_local_mmr(UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR); - if (mmr.s.enable) - map_high("MMR", mmr.s.base, shift, max_pnode, map_uc); -} - -static __init void map_mmioh_high(int max_pnode) -{ - union uvh_rh_gam_mmioh_overlay_config_mmr_u mmioh; - int shift = UVH_RH_GAM_MMIOH_OVERLAY_CONFIG_MMR_BASE_SHFT; - - mmioh.v = uv_read_local_mmr(UVH_RH_GAM_MMIOH_OVERLAY_CONFIG_MMR); - if (mmioh.s.enable) - map_high("MMIOH", mmioh.s.base, shift, max_pnode, map_uc); -} - -static __init void uv_rtc_init(void) -{ - long status; - u64 ticks_per_sec; - - status = uv_bios_freq_base(BIOS_FREQ_BASE_REALTIME_CLOCK, - &ticks_per_sec); - if (status != BIOS_STATUS_SUCCESS || ticks_per_sec < 100000) { - printk(KERN_WARNING - "unable to determine platform RTC clock frequency, " - "guessing.\n"); - /* BIOS gives wrong value for clock freq. so guess */ - sn_rtc_cycles_per_second = 1000000000000UL / 30000UL; - } else - sn_rtc_cycles_per_second = ticks_per_sec; -} - -/* - * percpu heartbeat timer - */ -static void uv_heartbeat(unsigned long ignored) -{ - struct timer_list *timer = &uv_hub_info->scir.timer; - unsigned char bits = uv_hub_info->scir.state; - - /* flip heartbeat bit */ - bits ^= SCIR_CPU_HEARTBEAT; - - /* is this cpu idle? */ - if (idle_cpu(raw_smp_processor_id())) - bits &= ~SCIR_CPU_ACTIVITY; - else - bits |= SCIR_CPU_ACTIVITY; - - /* update system controller interface reg */ - uv_set_scir_bits(bits); - - /* enable next timer period */ - mod_timer(timer, jiffies + SCIR_CPU_HB_INTERVAL); -} - -static void __cpuinit uv_heartbeat_enable(int cpu) -{ - if (!uv_cpu_hub_info(cpu)->scir.enabled) { - struct timer_list *timer = &uv_cpu_hub_info(cpu)->scir.timer; - - uv_set_cpu_scir_bits(cpu, SCIR_CPU_HEARTBEAT|SCIR_CPU_ACTIVITY); - setup_timer(timer, uv_heartbeat, cpu); - timer->expires = jiffies + SCIR_CPU_HB_INTERVAL; - add_timer_on(timer, cpu); - uv_cpu_hub_info(cpu)->scir.enabled = 1; - } - - /* check boot cpu */ - if (!uv_cpu_hub_info(0)->scir.enabled) - uv_heartbeat_enable(0); -} - -#ifdef CONFIG_HOTPLUG_CPU -static void __cpuinit uv_heartbeat_disable(int cpu) -{ - if (uv_cpu_hub_info(cpu)->scir.enabled) { - uv_cpu_hub_info(cpu)->scir.enabled = 0; - del_timer(&uv_cpu_hub_info(cpu)->scir.timer); - } - uv_set_cpu_scir_bits(cpu, 0xff); -} - -/* - * cpu hotplug notifier - */ -static __cpuinit int uv_scir_cpu_notify(struct notifier_block *self, - unsigned long action, void *hcpu) -{ - long cpu = (long)hcpu; - - switch (action) { - case CPU_ONLINE: - uv_heartbeat_enable(cpu); - break; - case CPU_DOWN_PREPARE: - uv_heartbeat_disable(cpu); - break; - default: - break; - } - return NOTIFY_OK; -} - -static __init void uv_scir_register_cpu_notifier(void) -{ - hotcpu_notifier(uv_scir_cpu_notify, 0); -} - -#else /* !CONFIG_HOTPLUG_CPU */ - -static __init void uv_scir_register_cpu_notifier(void) -{ -} - -static __init int uv_init_heartbeat(void) -{ - int cpu; - - if (is_uv_system()) - for_each_online_cpu(cpu) - uv_heartbeat_enable(cpu); - return 0; -} - -late_initcall(uv_init_heartbeat); - -#endif /* !CONFIG_HOTPLUG_CPU */ - -/* - * Called on each cpu to initialize the per_cpu UV data area. - * ZZZ hotplug not supported yet - */ -void __cpuinit uv_cpu_init(void) -{ - /* CPU 0 initilization will be done via uv_system_init. */ - if (!uv_blade_info) - return; - - uv_blade_info[uv_numa_blade_id()].nr_online_cpus++; - - if (get_uv_system_type() == UV_NON_UNIQUE_APIC) - set_x2apic_extra_bits(uv_hub_info->pnode); -} - - -void __init uv_system_init(void) -{ - union uvh_si_addr_map_config_u m_n_config; - union uvh_node_id_u node_id; - unsigned long gnode_upper, lowmem_redir_base, lowmem_redir_size; - int bytes, nid, cpu, lcpu, pnode, blade, i, j, m_val, n_val; - int max_pnode = 0; - unsigned long mmr_base, present; - - map_low_mmrs(); - - m_n_config.v = uv_read_local_mmr(UVH_SI_ADDR_MAP_CONFIG); - m_val = m_n_config.s.m_skt; - n_val = m_n_config.s.n_skt; - mmr_base = - uv_read_local_mmr(UVH_RH_GAM_MMR_OVERLAY_CONFIG_MMR) & - ~UV_MMR_ENABLE; - printk(KERN_DEBUG "UV: global MMR base 0x%lx\n", mmr_base); - - for(i = 0; i < UVH_NODE_PRESENT_TABLE_DEPTH; i++) - uv_possible_blades += - hweight64(uv_read_local_mmr( UVH_NODE_PRESENT_TABLE + i * 8)); - printk(KERN_DEBUG "UV: Found %d blades\n", uv_num_possible_blades()); - - bytes = sizeof(struct uv_blade_info) * uv_num_possible_blades(); - uv_blade_info = kmalloc(bytes, GFP_KERNEL); - - get_lowmem_redirect(&lowmem_redir_base, &lowmem_redir_size); - - bytes = sizeof(uv_node_to_blade[0]) * num_possible_nodes(); - uv_node_to_blade = kmalloc(bytes, GFP_KERNEL); - memset(uv_node_to_blade, 255, bytes); - - bytes = sizeof(uv_cpu_to_blade[0]) * num_possible_cpus(); - uv_cpu_to_blade = kmalloc(bytes, GFP_KERNEL); - memset(uv_cpu_to_blade, 255, bytes); - - blade = 0; - for (i = 0; i < UVH_NODE_PRESENT_TABLE_DEPTH; i++) { - present = uv_read_local_mmr(UVH_NODE_PRESENT_TABLE + i * 8); - for (j = 0; j < 64; j++) { - if (!test_bit(j, &present)) - continue; - uv_blade_info[blade].pnode = (i * 64 + j); - uv_blade_info[blade].nr_possible_cpus = 0; - uv_blade_info[blade].nr_online_cpus = 0; - blade++; - } - } - - node_id.v = uv_read_local_mmr(UVH_NODE_ID); - gnode_upper = (((unsigned long)node_id.s.node_id) & - ~((1 << n_val) - 1)) << m_val; - - uv_bios_init(); - uv_bios_get_sn_info(0, &uv_type, &sn_partition_id, - &sn_coherency_id, &sn_region_size); - uv_rtc_init(); - - for_each_present_cpu(cpu) { - nid = cpu_to_node(cpu); - pnode = uv_apicid_to_pnode(per_cpu(x86_cpu_to_apicid, cpu)); - blade = boot_pnode_to_blade(pnode); - lcpu = uv_blade_info[blade].nr_possible_cpus; - uv_blade_info[blade].nr_possible_cpus++; - - uv_cpu_hub_info(cpu)->lowmem_remap_base = lowmem_redir_base; - uv_cpu_hub_info(cpu)->lowmem_remap_top = lowmem_redir_size; - uv_cpu_hub_info(cpu)->m_val = m_val; - uv_cpu_hub_info(cpu)->n_val = m_val; - uv_cpu_hub_info(cpu)->numa_blade_id = blade; - uv_cpu_hub_info(cpu)->blade_processor_id = lcpu; - uv_cpu_hub_info(cpu)->pnode = pnode; - uv_cpu_hub_info(cpu)->pnode_mask = (1 << n_val) - 1; - uv_cpu_hub_info(cpu)->gpa_mask = (1 << (m_val + n_val)) - 1; - uv_cpu_hub_info(cpu)->gnode_upper = gnode_upper; - uv_cpu_hub_info(cpu)->global_mmr_base = mmr_base; - uv_cpu_hub_info(cpu)->coherency_domain_number = sn_coherency_id; - uv_cpu_hub_info(cpu)->scir.offset = SCIR_LOCAL_MMR_BASE + lcpu; - uv_node_to_blade[nid] = blade; - uv_cpu_to_blade[cpu] = blade; - max_pnode = max(pnode, max_pnode); - - printk(KERN_DEBUG "UV: cpu %d, apicid 0x%x, pnode %d, nid %d, " - "lcpu %d, blade %d\n", - cpu, per_cpu(x86_cpu_to_apicid, cpu), pnode, nid, - lcpu, blade); - } - - map_gru_high(max_pnode); - map_mmr_high(max_pnode); - map_config_high(max_pnode); - map_mmioh_high(max_pnode); - - uv_cpu_init(); - uv_scir_register_cpu_notifier(); - proc_mkdir("sgi_uv", NULL); -} Index: linux-2.6-tip/arch/x86/kernel/head64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/head64.c +++ linux-2.6-tip/arch/x86/kernel/head64.c @@ -26,32 +26,15 @@ #include #include -/* boot cpu pda */ -static struct x8664_pda _boot_cpu_pda; - -#ifdef CONFIG_SMP -/* - * We install an empty cpu_pda pointer table to indicate to early users - * (numa_set_node) that the cpu_pda pointer table for cpus other than - * the boot cpu is not yet setup. - */ -static struct x8664_pda *__cpu_pda[NR_CPUS] __initdata; -#else -static struct x8664_pda *__cpu_pda[NR_CPUS] __read_mostly; -#endif - -void __init x86_64_init_pda(void) -{ - _cpu_pda = __cpu_pda; - cpu_pda(0) = &_boot_cpu_pda; - pda_init(0); -} - static void __init zap_identity_mappings(void) { pgd_t *pgd = pgd_offset_k(0UL); pgd_clear(pgd); - __flush_tlb_all(); + /* + * preempt_disable/enable does not work this early in the + * bootup yet: + */ + write_cr3(read_cr3()); } /* Don't add a printk in there. printk relies on the PDA which is not initialized @@ -112,8 +95,6 @@ void __init x86_64_start_kernel(char * r if (console_loglevel == 10) early_printk("Kernel alive\n"); - x86_64_init_pda(); - x86_64_start_reservations(real_mode_data); } Index: linux-2.6-tip/arch/x86/kernel/head_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/head_32.S +++ linux-2.6-tip/arch/x86/kernel/head_32.S @@ -11,14 +11,15 @@ #include #include #include -#include -#include +#include +#include #include #include #include #include #include #include +#include /* Physical address */ #define pa(X) ((X) - __PAGE_OFFSET) @@ -60,7 +61,7 @@ LOW_PAGES = 1<<(32-PAGE_SHIFT_asm) * pagetables from above the 16MB DMA limit, so we'll have to set * up pagetables 16MB more (worst-case): */ -#ifdef CONFIG_DEBUG_PAGEALLOC +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK) LOW_PAGES = LOW_PAGES + 0x1000000 #endif @@ -429,14 +430,34 @@ is386: movl $2,%ecx # set MP ljmp $(__KERNEL_CS),$1f 1: movl $(__KERNEL_DS),%eax # reload all the segment registers movl %eax,%ss # after changing gdt. - movl %eax,%fs # gets reset once there's real percpu movl $(__USER_DS),%eax # DS/ES contains default USER segment movl %eax,%ds movl %eax,%es - xorl %eax,%eax # Clear GS and LDT + movl $(__KERNEL_PERCPU), %eax + movl %eax,%fs # set this cpu's percpu + +#ifdef CONFIG_CC_STACKPROTECTOR + /* + * The linker can't handle this by relocation. Manually set + * base address in stack canary segment descriptor. + */ + cmpb $0,ready + jne 1f + movl $per_cpu__gdt_page,%eax + movl $per_cpu__stack_canary,%ecx + subl $20, %ecx + movw %cx, 8 * GDT_ENTRY_STACK_CANARY + 2(%eax) + shrl $16, %ecx + movb %cl, 8 * GDT_ENTRY_STACK_CANARY + 4(%eax) + movb %ch, 8 * GDT_ENTRY_STACK_CANARY + 7(%eax) +1: +#endif + movl $(__KERNEL_STACK_CANARY),%eax movl %eax,%gs + + xorl %eax,%eax # Clear LDT lldt %ax cld # gcc2 wants the direction flag cleared at all times @@ -446,8 +467,6 @@ is386: movl $2,%ecx # set MP movb $1, ready cmpb $0,%cl # the first CPU calls start_kernel je 1f - movl $(__KERNEL_PERCPU), %eax - movl %eax,%fs # set this cpu's percpu movl (stack_start), %esp 1: #endif /* CONFIG_SMP */ @@ -548,12 +567,8 @@ early_fault: pushl %eax pushl %edx /* trapno */ pushl $fault_msg -#ifdef CONFIG_EARLY_PRINTK - call early_printk -#else call printk #endif -#endif call dump_stack hlt_loop: hlt @@ -580,12 +595,12 @@ ignore_int: pushl 32(%esp) pushl 40(%esp) pushl $int_msg -#ifdef CONFIG_EARLY_PRINTK - call early_printk -#else call printk -#endif + + call dump_stack + addl $(5*4),%esp + call dump_stack popl %ds popl %es popl %edx @@ -660,7 +675,7 @@ early_recursion_flag: .long 0 int_msg: - .asciz "Unknown interrupt or fault at EIP %p %p %p\n" + .asciz "Unknown interrupt or fault at: %p %p %p\n" fault_msg: /* fault info: */ Index: linux-2.6-tip/arch/x86/kernel/head_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/head_64.S +++ linux-2.6-tip/arch/x86/kernel/head_64.S @@ -19,6 +19,7 @@ #include #include #include +#include #ifdef CONFIG_PARAVIRT #include @@ -226,12 +227,15 @@ ENTRY(secondary_startup_64) movl %eax,%fs movl %eax,%gs - /* - * Setup up a dummy PDA. this is just for some early bootup code - * that does in_interrupt() - */ + /* Set up %gs. + * + * The base of %gs always points to the bottom of the irqstack + * union. If the stack protector canary is enabled, it is + * located at %gs:40. Note that, on SMP, the boot cpu uses + * init data section till per cpu areas are set up. + */ movl $MSR_GS_BASE,%ecx - movq $empty_zero_page,%rax + movq initial_gs(%rip),%rax movq %rax,%rdx shrq $32,%rdx wrmsr @@ -257,6 +261,8 @@ ENTRY(secondary_startup_64) .align 8 ENTRY(initial_code) .quad x86_64_start_kernel + ENTRY(initial_gs) + .quad INIT_PER_CPU_VAR(irq_stack_union) __FINITDATA ENTRY(stack_start) @@ -401,7 +407,8 @@ NEXT_PAGE(level2_spare_pgt) .globl early_gdt_descr early_gdt_descr: .word GDT_ENTRIES*8-1 - .quad per_cpu__gdt_page +early_gdt_descr_base: + .quad INIT_PER_CPU_VAR(gdt_page) ENTRY(phys_base) /* This must match the first entry in level2_kernel_pgt */ Index: linux-2.6-tip/arch/x86/kernel/hpet.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/hpet.c +++ linux-2.6-tip/arch/x86/kernel/hpet.c @@ -80,6 +80,7 @@ static inline void hpet_clear_mapping(vo */ static int boot_hpet_disable; int hpet_force_user; +static int hpet_verbose; static int __init hpet_setup(char *str) { @@ -88,6 +89,8 @@ static int __init hpet_setup(char *str) boot_hpet_disable = 1; if (!strncmp("force", str, 5)) hpet_force_user = 1; + if (!strncmp("verbose", str, 7)) + hpet_verbose = 1; } return 1; } @@ -119,6 +122,43 @@ int is_hpet_enabled(void) } EXPORT_SYMBOL_GPL(is_hpet_enabled); +static void _hpet_print_config(const char *function, int line) +{ + u32 i, timers, l, h; + printk(KERN_INFO "hpet: %s(%d):\n", function, line); + l = hpet_readl(HPET_ID); + h = hpet_readl(HPET_PERIOD); + timers = ((l & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT) + 1; + printk(KERN_INFO "hpet: ID: 0x%x, PERIOD: 0x%x\n", l, h); + l = hpet_readl(HPET_CFG); + h = hpet_readl(HPET_STATUS); + printk(KERN_INFO "hpet: CFG: 0x%x, STATUS: 0x%x\n", l, h); + l = hpet_readl(HPET_COUNTER); + h = hpet_readl(HPET_COUNTER+4); + printk(KERN_INFO "hpet: COUNTER_l: 0x%x, COUNTER_h: 0x%x\n", l, h); + + for (i = 0; i < timers; i++) { + l = hpet_readl(HPET_Tn_CFG(i)); + h = hpet_readl(HPET_Tn_CFG(i)+4); + printk(KERN_INFO "hpet: T%d: CFG_l: 0x%x, CFG_h: 0x%x\n", + i, l, h); + l = hpet_readl(HPET_Tn_CMP(i)); + h = hpet_readl(HPET_Tn_CMP(i)+4); + printk(KERN_INFO "hpet: T%d: CMP_l: 0x%x, CMP_h: 0x%x\n", + i, l, h); + l = hpet_readl(HPET_Tn_ROUTE(i)); + h = hpet_readl(HPET_Tn_ROUTE(i)+4); + printk(KERN_INFO "hpet: T%d ROUTE_l: 0x%x, ROUTE_h: 0x%x\n", + i, l, h); + } +} + +#define hpet_print_config() \ +do { \ + if (hpet_verbose) \ + _hpet_print_config(__FUNCTION__, __LINE__); \ +} while (0) + /* * When the hpet driver (/dev/hpet) is enabled, we need to reserve * timer 0 and timer 1 in case of RTC emulation. @@ -191,27 +231,37 @@ static struct clock_event_device hpet_cl .rating = 50, }; -static void hpet_start_counter(void) +static void hpet_stop_counter(void) { unsigned long cfg = hpet_readl(HPET_CFG); - cfg &= ~HPET_CFG_ENABLE; hpet_writel(cfg, HPET_CFG); hpet_writel(0, HPET_COUNTER); hpet_writel(0, HPET_COUNTER + 4); +} + +static void hpet_start_counter(void) +{ + unsigned long cfg = hpet_readl(HPET_CFG); cfg |= HPET_CFG_ENABLE; hpet_writel(cfg, HPET_CFG); } +static void hpet_restart_counter(void) +{ + hpet_stop_counter(); + hpet_start_counter(); +} + static void hpet_resume_device(void) { force_hpet_resume(); } -static void hpet_restart_counter(void) +static void hpet_resume_counter(void) { hpet_resume_device(); - hpet_start_counter(); + hpet_restart_counter(); } static void hpet_enable_legacy_int(void) @@ -259,29 +309,23 @@ static int hpet_setup_msi_irq(unsigned i static void hpet_set_mode(enum clock_event_mode mode, struct clock_event_device *evt, int timer) { - unsigned long cfg, cmp, now; + unsigned long cfg; uint64_t delta; switch (mode) { case CLOCK_EVT_MODE_PERIODIC: + hpet_stop_counter(); delta = ((uint64_t)(NSEC_PER_SEC/HZ)) * evt->mult; delta >>= evt->shift; - now = hpet_readl(HPET_COUNTER); - cmp = now + (unsigned long) delta; cfg = hpet_readl(HPET_Tn_CFG(timer)); /* Make sure we use edge triggered interrupts */ cfg &= ~HPET_TN_LEVEL; cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | HPET_TN_32BIT; hpet_writel(cfg, HPET_Tn_CFG(timer)); - /* - * The first write after writing TN_SETVAL to the - * config register sets the counter value, the second - * write sets the period. - */ - hpet_writel(cmp, HPET_Tn_CMP(timer)); - udelay(1); hpet_writel((unsigned long) delta, HPET_Tn_CMP(timer)); + hpet_start_counter(); + hpet_print_config(); break; case CLOCK_EVT_MODE_ONESHOT: @@ -308,6 +352,7 @@ static void hpet_set_mode(enum clock_eve irq_set_affinity(hdev->irq, cpumask_of(hdev->cpu)); enable_irq(hdev->irq); } + hpet_print_config(); break; } } @@ -526,6 +571,7 @@ static void hpet_msi_capability_lookup(u num_timers = ((id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT); num_timers++; /* Value read out starts from 0 */ + hpet_print_config(); hpet_devs = kzalloc(sizeof(struct hpet_dev) * num_timers, GFP_KERNEL); if (!hpet_devs) @@ -695,7 +741,7 @@ static struct clocksource clocksource_hp .mask = HPET_MASK, .shift = HPET_SHIFT, .flags = CLOCK_SOURCE_IS_CONTINUOUS, - .resume = hpet_restart_counter, + .resume = hpet_resume_counter, #ifdef CONFIG_X86_64 .vread = vread_hpet, #endif @@ -707,7 +753,7 @@ static int hpet_clocksource_register(voi cycle_t t1; /* Start the counter */ - hpet_start_counter(); + hpet_restart_counter(); /* Verify whether hpet counter works */ t1 = read_hpet(); @@ -793,6 +839,7 @@ int __init hpet_enable(void) * information and the number of channels */ id = hpet_readl(HPET_ID); + hpet_print_config(); #ifdef CONFIG_HPET_EMULATE_RTC /* @@ -845,6 +892,7 @@ static __init int hpet_late_init(void) return -ENODEV; hpet_reserve_platform_timers(hpet_readl(HPET_ID)); + hpet_print_config(); for_each_online_cpu(cpu) { hpet_cpuhp_notify(NULL, CPU_ONLINE, (void *)(long)cpu); Index: linux-2.6-tip/arch/x86/kernel/i8259.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/i8259.c +++ linux-2.6-tip/arch/x86/kernel/i8259.c @@ -22,7 +22,6 @@ #include #include #include -#include #include /* @@ -33,8 +32,8 @@ */ static int i8259A_auto_eoi; -DEFINE_SPINLOCK(i8259A_lock); static void mask_and_ack_8259A(unsigned int); +DEFINE_RAW_SPINLOCK(i8259A_lock); struct irq_chip i8259A_chip = { .name = "XT-PIC", @@ -169,6 +168,8 @@ static void mask_and_ack_8259A(unsigned */ if (cached_irq_mask & irqmask) goto spurious_8259A_irq; + if (irq & 8) + outb(0x60+(irq&7), PIC_SLAVE_CMD); /* 'Specific EOI' to slave */ cached_irq_mask |= irqmask; handle_real_irq: @@ -329,10 +330,10 @@ void init_8259A(int auto_eoi) /* 8259A-1 (the master) has a slave on IR2 */ outb_pic(1U << PIC_CASCADE_IR, PIC_MASTER_IMR); - if (auto_eoi) /* master does Auto EOI */ - outb_pic(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); - else /* master expects normal EOI */ - outb_pic(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); + if (!auto_eoi) /* master expects normal EOI */ + outb_p(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); + else /* master does Auto EOI */ + outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); outb_pic(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ Index: linux-2.6-tip/arch/x86/kernel/io_apic.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/io_apic.c +++ /dev/null @@ -1,4169 +0,0 @@ -/* - * Intel IO-APIC support for multi-Pentium hosts. - * - * Copyright (C) 1997, 1998, 1999, 2000 Ingo Molnar, Hajnalka Szabo - * - * Many thanks to Stig Venaas for trying out countless experimental - * patches and reporting/debugging problems patiently! - * - * (c) 1999, Multiple IO-APIC support, developed by - * Ken-ichi Yaku and - * Hidemi Kishimoto , - * further tested and cleaned up by Zach Brown - * and Ingo Molnar - * - * Fixes - * Maciej W. Rozycki : Bits for genuine 82489DX APICs; - * thanks to Eric Gilmore - * and Rolf G. Tews - * for testing these extensively - * Paul Diefenbaugh : Added full ACPI support - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include /* time_after() */ -#ifdef CONFIG_ACPI -#include -#endif -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -#define __apicdebuginit(type) static type __init - -/* - * Is the SiS APIC rmw bug present ? - * -1 = don't know, 0 = no, 1 = yes - */ -int sis_apic_bug = -1; - -static DEFINE_SPINLOCK(ioapic_lock); -static DEFINE_SPINLOCK(vector_lock); - -/* - * # of IRQ routing registers - */ -int nr_ioapic_registers[MAX_IO_APICS]; - -/* I/O APIC entries */ -struct mp_config_ioapic mp_ioapics[MAX_IO_APICS]; -int nr_ioapics; - -/* MP IRQ source entries */ -struct mp_config_intsrc mp_irqs[MAX_IRQ_SOURCES]; - -/* # of MP IRQ source entries */ -int mp_irq_entries; - -#if defined (CONFIG_MCA) || defined (CONFIG_EISA) -int mp_bus_id_to_type[MAX_MP_BUSSES]; -#endif - -DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); - -int skip_ioapic_setup; - -static int __init parse_noapic(char *str) -{ - /* disable IO-APIC */ - disable_ioapic_setup(); - return 0; -} -early_param("noapic", parse_noapic); - -struct irq_pin_list; - -/* - * This is performance-critical, we want to do it O(1) - * - * the indexing order of this array favors 1:1 mappings - * between pins and IRQs. - */ - -struct irq_pin_list { - int apic, pin; - struct irq_pin_list *next; -}; - -static struct irq_pin_list *get_one_free_irq_2_pin(int cpu) -{ - struct irq_pin_list *pin; - int node; - - node = cpu_to_node(cpu); - - pin = kzalloc_node(sizeof(*pin), GFP_ATOMIC, node); - - return pin; -} - -struct irq_cfg { - struct irq_pin_list *irq_2_pin; - cpumask_var_t domain; - cpumask_var_t old_domain; - unsigned move_cleanup_count; - u8 vector; - u8 move_in_progress : 1; -#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC - u8 move_desc_pending : 1; -#endif -}; - -/* irq_cfg is indexed by the sum of all RTEs in all I/O APICs. */ -#ifdef CONFIG_SPARSE_IRQ -static struct irq_cfg irq_cfgx[] = { -#else -static struct irq_cfg irq_cfgx[NR_IRQS] = { -#endif - [0] = { .vector = IRQ0_VECTOR, }, - [1] = { .vector = IRQ1_VECTOR, }, - [2] = { .vector = IRQ2_VECTOR, }, - [3] = { .vector = IRQ3_VECTOR, }, - [4] = { .vector = IRQ4_VECTOR, }, - [5] = { .vector = IRQ5_VECTOR, }, - [6] = { .vector = IRQ6_VECTOR, }, - [7] = { .vector = IRQ7_VECTOR, }, - [8] = { .vector = IRQ8_VECTOR, }, - [9] = { .vector = IRQ9_VECTOR, }, - [10] = { .vector = IRQ10_VECTOR, }, - [11] = { .vector = IRQ11_VECTOR, }, - [12] = { .vector = IRQ12_VECTOR, }, - [13] = { .vector = IRQ13_VECTOR, }, - [14] = { .vector = IRQ14_VECTOR, }, - [15] = { .vector = IRQ15_VECTOR, }, -}; - -int __init arch_early_irq_init(void) -{ - struct irq_cfg *cfg; - struct irq_desc *desc; - int count; - int i; - - cfg = irq_cfgx; - count = ARRAY_SIZE(irq_cfgx); - - for (i = 0; i < count; i++) { - desc = irq_to_desc(i); - desc->chip_data = &cfg[i]; - alloc_bootmem_cpumask_var(&cfg[i].domain); - alloc_bootmem_cpumask_var(&cfg[i].old_domain); - if (i < NR_IRQS_LEGACY) - cpumask_setall(cfg[i].domain); - } - - return 0; -} - -#ifdef CONFIG_SPARSE_IRQ -static struct irq_cfg *irq_cfg(unsigned int irq) -{ - struct irq_cfg *cfg = NULL; - struct irq_desc *desc; - - desc = irq_to_desc(irq); - if (desc) - cfg = desc->chip_data; - - return cfg; -} - -static struct irq_cfg *get_one_free_irq_cfg(int cpu) -{ - struct irq_cfg *cfg; - int node; - - node = cpu_to_node(cpu); - - cfg = kzalloc_node(sizeof(*cfg), GFP_ATOMIC, node); - if (cfg) { - if (!alloc_cpumask_var_node(&cfg->domain, GFP_ATOMIC, node)) { - kfree(cfg); - cfg = NULL; - } else if (!alloc_cpumask_var_node(&cfg->old_domain, - GFP_ATOMIC, node)) { - free_cpumask_var(cfg->domain); - kfree(cfg); - cfg = NULL; - } else { - cpumask_clear(cfg->domain); - cpumask_clear(cfg->old_domain); - } - } - - return cfg; -} - -int arch_init_chip_data(struct irq_desc *desc, int cpu) -{ - struct irq_cfg *cfg; - - cfg = desc->chip_data; - if (!cfg) { - desc->chip_data = get_one_free_irq_cfg(cpu); - if (!desc->chip_data) { - printk(KERN_ERR "can not alloc irq_cfg\n"); - BUG_ON(1); - } - } - - return 0; -} - -#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC - -static void -init_copy_irq_2_pin(struct irq_cfg *old_cfg, struct irq_cfg *cfg, int cpu) -{ - struct irq_pin_list *old_entry, *head, *tail, *entry; - - cfg->irq_2_pin = NULL; - old_entry = old_cfg->irq_2_pin; - if (!old_entry) - return; - - entry = get_one_free_irq_2_pin(cpu); - if (!entry) - return; - - entry->apic = old_entry->apic; - entry->pin = old_entry->pin; - head = entry; - tail = entry; - old_entry = old_entry->next; - while (old_entry) { - entry = get_one_free_irq_2_pin(cpu); - if (!entry) { - entry = head; - while (entry) { - head = entry->next; - kfree(entry); - entry = head; - } - /* still use the old one */ - return; - } - entry->apic = old_entry->apic; - entry->pin = old_entry->pin; - tail->next = entry; - tail = entry; - old_entry = old_entry->next; - } - - tail->next = NULL; - cfg->irq_2_pin = head; -} - -static void free_irq_2_pin(struct irq_cfg *old_cfg, struct irq_cfg *cfg) -{ - struct irq_pin_list *entry, *next; - - if (old_cfg->irq_2_pin == cfg->irq_2_pin) - return; - - entry = old_cfg->irq_2_pin; - - while (entry) { - next = entry->next; - kfree(entry); - entry = next; - } - old_cfg->irq_2_pin = NULL; -} - -void arch_init_copy_chip_data(struct irq_desc *old_desc, - struct irq_desc *desc, int cpu) -{ - struct irq_cfg *cfg; - struct irq_cfg *old_cfg; - - cfg = get_one_free_irq_cfg(cpu); - - if (!cfg) - return; - - desc->chip_data = cfg; - - old_cfg = old_desc->chip_data; - - memcpy(cfg, old_cfg, sizeof(struct irq_cfg)); - - init_copy_irq_2_pin(old_cfg, cfg, cpu); -} - -static void free_irq_cfg(struct irq_cfg *old_cfg) -{ - kfree(old_cfg); -} - -void arch_free_chip_data(struct irq_desc *old_desc, struct irq_desc *desc) -{ - struct irq_cfg *old_cfg, *cfg; - - old_cfg = old_desc->chip_data; - cfg = desc->chip_data; - - if (old_cfg == cfg) - return; - - if (old_cfg) { - free_irq_2_pin(old_cfg, cfg); - free_irq_cfg(old_cfg); - old_desc->chip_data = NULL; - } -} - -static void -set_extra_move_desc(struct irq_desc *desc, const struct cpumask *mask) -{ - struct irq_cfg *cfg = desc->chip_data; - - if (!cfg->move_in_progress) { - /* it means that domain is not changed */ - if (!cpumask_intersects(&desc->affinity, mask)) - cfg->move_desc_pending = 1; - } -} -#endif - -#else -static struct irq_cfg *irq_cfg(unsigned int irq) -{ - return irq < nr_irqs ? irq_cfgx + irq : NULL; -} - -#endif - -#ifndef CONFIG_NUMA_MIGRATE_IRQ_DESC -static inline void -set_extra_move_desc(struct irq_desc *desc, const struct cpumask *mask) -{ -} -#endif - -struct io_apic { - unsigned int index; - unsigned int unused[3]; - unsigned int data; -}; - -static __attribute_const__ struct io_apic __iomem *io_apic_base(int idx) -{ - return (void __iomem *) __fix_to_virt(FIX_IO_APIC_BASE_0 + idx) - + (mp_ioapics[idx].mp_apicaddr & ~PAGE_MASK); -} - -static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg) -{ - struct io_apic __iomem *io_apic = io_apic_base(apic); - writel(reg, &io_apic->index); - return readl(&io_apic->data); -} - -static inline void io_apic_write(unsigned int apic, unsigned int reg, unsigned int value) -{ - struct io_apic __iomem *io_apic = io_apic_base(apic); - writel(reg, &io_apic->index); - writel(value, &io_apic->data); -} - -/* - * Re-write a value: to be used for read-modify-write - * cycles where the read already set up the index register. - * - * Older SiS APIC requires we rewrite the index register - */ -static inline void io_apic_modify(unsigned int apic, unsigned int reg, unsigned int value) -{ - struct io_apic __iomem *io_apic = io_apic_base(apic); - - if (sis_apic_bug) - writel(reg, &io_apic->index); - writel(value, &io_apic->data); -} - -static bool io_apic_level_ack_pending(struct irq_cfg *cfg) -{ - struct irq_pin_list *entry; - unsigned long flags; - - spin_lock_irqsave(&ioapic_lock, flags); - entry = cfg->irq_2_pin; - for (;;) { - unsigned int reg; - int pin; - - if (!entry) - break; - pin = entry->pin; - reg = io_apic_read(entry->apic, 0x10 + pin*2); - /* Is the remote IRR bit set? */ - if (reg & IO_APIC_REDIR_REMOTE_IRR) { - spin_unlock_irqrestore(&ioapic_lock, flags); - return true; - } - if (!entry->next) - break; - entry = entry->next; - } - spin_unlock_irqrestore(&ioapic_lock, flags); - - return false; -} - -union entry_union { - struct { u32 w1, w2; }; - struct IO_APIC_route_entry entry; -}; - -static struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin) -{ - union entry_union eu; - unsigned long flags; - spin_lock_irqsave(&ioapic_lock, flags); - eu.w1 = io_apic_read(apic, 0x10 + 2 * pin); - eu.w2 = io_apic_read(apic, 0x11 + 2 * pin); - spin_unlock_irqrestore(&ioapic_lock, flags); - return eu.entry; -} - -/* - * When we write a new IO APIC routing entry, we need to write the high - * word first! If the mask bit in the low word is clear, we will enable - * the interrupt, and we need to make sure the entry is fully populated - * before that happens. - */ -static void -__ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) -{ - union entry_union eu; - eu.entry = e; - io_apic_write(apic, 0x11 + 2*pin, eu.w2); - io_apic_write(apic, 0x10 + 2*pin, eu.w1); -} - -static void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) -{ - unsigned long flags; - spin_lock_irqsave(&ioapic_lock, flags); - __ioapic_write_entry(apic, pin, e); - spin_unlock_irqrestore(&ioapic_lock, flags); -} - -/* - * When we mask an IO APIC routing entry, we need to write the low - * word first, in order to set the mask bit before we change the - * high bits! - */ -static void ioapic_mask_entry(int apic, int pin) -{ - unsigned long flags; - union entry_union eu = { .entry.mask = 1 }; - - spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(apic, 0x10 + 2*pin, eu.w1); - io_apic_write(apic, 0x11 + 2*pin, eu.w2); - spin_unlock_irqrestore(&ioapic_lock, flags); -} - -#ifdef CONFIG_SMP -static void send_cleanup_vector(struct irq_cfg *cfg) -{ - cpumask_var_t cleanup_mask; - - if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) { - unsigned int i; - cfg->move_cleanup_count = 0; - for_each_cpu_and(i, cfg->old_domain, cpu_online_mask) - cfg->move_cleanup_count++; - for_each_cpu_and(i, cfg->old_domain, cpu_online_mask) - send_IPI_mask(cpumask_of(i), IRQ_MOVE_CLEANUP_VECTOR); - } else { - cpumask_and(cleanup_mask, cfg->old_domain, cpu_online_mask); - cfg->move_cleanup_count = cpumask_weight(cleanup_mask); - send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR); - free_cpumask_var(cleanup_mask); - } - cfg->move_in_progress = 0; -} - -static void __target_IO_APIC_irq(unsigned int irq, unsigned int dest, struct irq_cfg *cfg) -{ - int apic, pin; - struct irq_pin_list *entry; - u8 vector = cfg->vector; - - entry = cfg->irq_2_pin; - for (;;) { - unsigned int reg; - - if (!entry) - break; - - apic = entry->apic; - pin = entry->pin; -#ifdef CONFIG_INTR_REMAP - /* - * With interrupt-remapping, destination information comes - * from interrupt-remapping table entry. - */ - if (!irq_remapped(irq)) - io_apic_write(apic, 0x11 + pin*2, dest); -#else - io_apic_write(apic, 0x11 + pin*2, dest); -#endif - reg = io_apic_read(apic, 0x10 + pin*2); - reg &= ~IO_APIC_REDIR_VECTOR_MASK; - reg |= vector; - io_apic_modify(apic, 0x10 + pin*2, reg); - if (!entry->next) - break; - entry = entry->next; - } -} - -static int -assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask); - -/* - * Either sets desc->affinity to a valid value, and returns cpu_mask_to_apicid - * of that, or returns BAD_APICID and leaves desc->affinity untouched. - */ -static unsigned int -set_desc_affinity(struct irq_desc *desc, const struct cpumask *mask) -{ - struct irq_cfg *cfg; - unsigned int irq; - - if (!cpumask_intersects(mask, cpu_online_mask)) - return BAD_APICID; - - irq = desc->irq; - cfg = desc->chip_data; - if (assign_irq_vector(irq, cfg, mask)) - return BAD_APICID; - - cpumask_and(&desc->affinity, cfg->domain, mask); - set_extra_move_desc(desc, mask); - return cpu_mask_to_apicid_and(&desc->affinity, cpu_online_mask); -} - -static void -set_ioapic_affinity_irq_desc(struct irq_desc *desc, const struct cpumask *mask) -{ - struct irq_cfg *cfg; - unsigned long flags; - unsigned int dest; - unsigned int irq; - - irq = desc->irq; - cfg = desc->chip_data; - - spin_lock_irqsave(&ioapic_lock, flags); - dest = set_desc_affinity(desc, mask); - if (dest != BAD_APICID) { - /* Only the high 8 bits are valid. */ - dest = SET_APIC_LOGICAL_ID(dest); - __target_IO_APIC_irq(irq, dest, cfg); - } - spin_unlock_irqrestore(&ioapic_lock, flags); -} - -static void -set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc; - - desc = irq_to_desc(irq); - - set_ioapic_affinity_irq_desc(desc, mask); -} -#endif /* CONFIG_SMP */ - -/* - * The common case is 1:1 IRQ<->pin mappings. Sometimes there are - * shared ISA-space IRQs, so we have to support them. We are super - * fast in the common case, and fast for shared ISA-space IRQs. - */ -static void add_pin_to_irq_cpu(struct irq_cfg *cfg, int cpu, int apic, int pin) -{ - struct irq_pin_list *entry; - - entry = cfg->irq_2_pin; - if (!entry) { - entry = get_one_free_irq_2_pin(cpu); - if (!entry) { - printk(KERN_ERR "can not alloc irq_2_pin to add %d - %d\n", - apic, pin); - return; - } - cfg->irq_2_pin = entry; - entry->apic = apic; - entry->pin = pin; - return; - } - - while (entry->next) { - /* not again, please */ - if (entry->apic == apic && entry->pin == pin) - return; - - entry = entry->next; - } - - entry->next = get_one_free_irq_2_pin(cpu); - entry = entry->next; - entry->apic = apic; - entry->pin = pin; -} - -/* - * Reroute an IRQ to a different pin. - */ -static void __init replace_pin_at_irq_cpu(struct irq_cfg *cfg, int cpu, - int oldapic, int oldpin, - int newapic, int newpin) -{ - struct irq_pin_list *entry = cfg->irq_2_pin; - int replaced = 0; - - while (entry) { - if (entry->apic == oldapic && entry->pin == oldpin) { - entry->apic = newapic; - entry->pin = newpin; - replaced = 1; - /* every one is different, right? */ - break; - } - entry = entry->next; - } - - /* why? call replace before add? */ - if (!replaced) - add_pin_to_irq_cpu(cfg, cpu, newapic, newpin); -} - -static inline void io_apic_modify_irq(struct irq_cfg *cfg, - int mask_and, int mask_or, - void (*final)(struct irq_pin_list *entry)) -{ - int pin; - struct irq_pin_list *entry; - - for (entry = cfg->irq_2_pin; entry != NULL; entry = entry->next) { - unsigned int reg; - pin = entry->pin; - reg = io_apic_read(entry->apic, 0x10 + pin * 2); - reg &= mask_and; - reg |= mask_or; - io_apic_modify(entry->apic, 0x10 + pin * 2, reg); - if (final) - final(entry); - } -} - -static void __unmask_IO_APIC_irq(struct irq_cfg *cfg) -{ - io_apic_modify_irq(cfg, ~IO_APIC_REDIR_MASKED, 0, NULL); -} - -#ifdef CONFIG_X86_64 -static void io_apic_sync(struct irq_pin_list *entry) -{ - /* - * Synchronize the IO-APIC and the CPU by doing - * a dummy read from the IO-APIC - */ - struct io_apic __iomem *io_apic; - io_apic = io_apic_base(entry->apic); - readl(&io_apic->data); -} - -static void __mask_IO_APIC_irq(struct irq_cfg *cfg) -{ - io_apic_modify_irq(cfg, ~0, IO_APIC_REDIR_MASKED, &io_apic_sync); -} -#else /* CONFIG_X86_32 */ -static void __mask_IO_APIC_irq(struct irq_cfg *cfg) -{ - io_apic_modify_irq(cfg, ~0, IO_APIC_REDIR_MASKED, NULL); -} - -static void __mask_and_edge_IO_APIC_irq(struct irq_cfg *cfg) -{ - io_apic_modify_irq(cfg, ~IO_APIC_REDIR_LEVEL_TRIGGER, - IO_APIC_REDIR_MASKED, NULL); -} - -static void __unmask_and_level_IO_APIC_irq(struct irq_cfg *cfg) -{ - io_apic_modify_irq(cfg, ~IO_APIC_REDIR_MASKED, - IO_APIC_REDIR_LEVEL_TRIGGER, NULL); -} -#endif /* CONFIG_X86_32 */ - -static void mask_IO_APIC_irq_desc(struct irq_desc *desc) -{ - struct irq_cfg *cfg = desc->chip_data; - unsigned long flags; - - BUG_ON(!cfg); - - spin_lock_irqsave(&ioapic_lock, flags); - __mask_IO_APIC_irq(cfg); - spin_unlock_irqrestore(&ioapic_lock, flags); -} - -static void unmask_IO_APIC_irq_desc(struct irq_desc *desc) -{ - struct irq_cfg *cfg = desc->chip_data; - unsigned long flags; - - spin_lock_irqsave(&ioapic_lock, flags); - __unmask_IO_APIC_irq(cfg); - spin_unlock_irqrestore(&ioapic_lock, flags); -} - -static void mask_IO_APIC_irq(unsigned int irq) -{ - struct irq_desc *desc = irq_to_desc(irq); - - mask_IO_APIC_irq_desc(desc); -} -static void unmask_IO_APIC_irq(unsigned int irq) -{ - struct irq_desc *desc = irq_to_desc(irq); - - unmask_IO_APIC_irq_desc(desc); -} - -static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) -{ - struct IO_APIC_route_entry entry; - - /* Check delivery_mode to be sure we're not clearing an SMI pin */ - entry = ioapic_read_entry(apic, pin); - if (entry.delivery_mode == dest_SMI) - return; - /* - * Disable it in the IO-APIC irq-routing table: - */ - ioapic_mask_entry(apic, pin); -} - -static void clear_IO_APIC (void) -{ - int apic, pin; - - for (apic = 0; apic < nr_ioapics; apic++) - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) - clear_IO_APIC_pin(apic, pin); -} - -#if !defined(CONFIG_SMP) && defined(CONFIG_X86_32) -void send_IPI_self(int vector) -{ - unsigned int cfg; - - /* - * Wait for idle. - */ - apic_wait_icr_idle(); - cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL; - /* - * Send the IPI. The write to APIC_ICR fires this off. - */ - apic_write(APIC_ICR, cfg); -} -#endif /* !CONFIG_SMP && CONFIG_X86_32*/ - -#ifdef CONFIG_X86_32 -/* - * support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to - * specific CPU-side IRQs. - */ - -#define MAX_PIRQS 8 -static int pirq_entries [MAX_PIRQS]; -static int pirqs_enabled; - -static int __init ioapic_pirq_setup(char *str) -{ - int i, max; - int ints[MAX_PIRQS+1]; - - get_options(str, ARRAY_SIZE(ints), ints); - - for (i = 0; i < MAX_PIRQS; i++) - pirq_entries[i] = -1; - - pirqs_enabled = 1; - apic_printk(APIC_VERBOSE, KERN_INFO - "PIRQ redirection, working around broken MP-BIOS.\n"); - max = MAX_PIRQS; - if (ints[0] < MAX_PIRQS) - max = ints[0]; - - for (i = 0; i < max; i++) { - apic_printk(APIC_VERBOSE, KERN_DEBUG - "... PIRQ%d -> IRQ %d\n", i, ints[i+1]); - /* - * PIRQs are mapped upside down, usually. - */ - pirq_entries[MAX_PIRQS-i-1] = ints[i+1]; - } - return 1; -} - -__setup("pirq=", ioapic_pirq_setup); -#endif /* CONFIG_X86_32 */ - -#ifdef CONFIG_INTR_REMAP -/* I/O APIC RTE contents at the OS boot up */ -static struct IO_APIC_route_entry *early_ioapic_entries[MAX_IO_APICS]; - -/* - * Saves and masks all the unmasked IO-APIC RTE's - */ -int save_mask_IO_APIC_setup(void) -{ - union IO_APIC_reg_01 reg_01; - unsigned long flags; - int apic, pin; - - /* - * The number of IO-APIC IRQ registers (== #pins): - */ - for (apic = 0; apic < nr_ioapics; apic++) { - spin_lock_irqsave(&ioapic_lock, flags); - reg_01.raw = io_apic_read(apic, 1); - spin_unlock_irqrestore(&ioapic_lock, flags); - nr_ioapic_registers[apic] = reg_01.bits.entries+1; - } - - for (apic = 0; apic < nr_ioapics; apic++) { - early_ioapic_entries[apic] = - kzalloc(sizeof(struct IO_APIC_route_entry) * - nr_ioapic_registers[apic], GFP_KERNEL); - if (!early_ioapic_entries[apic]) - goto nomem; - } - - for (apic = 0; apic < nr_ioapics; apic++) - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { - struct IO_APIC_route_entry entry; - - entry = early_ioapic_entries[apic][pin] = - ioapic_read_entry(apic, pin); - if (!entry.mask) { - entry.mask = 1; - ioapic_write_entry(apic, pin, entry); - } - } - - return 0; - -nomem: - while (apic >= 0) - kfree(early_ioapic_entries[apic--]); - memset(early_ioapic_entries, 0, - ARRAY_SIZE(early_ioapic_entries)); - - return -ENOMEM; -} - -void restore_IO_APIC_setup(void) -{ - int apic, pin; - - for (apic = 0; apic < nr_ioapics; apic++) { - if (!early_ioapic_entries[apic]) - break; - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) - ioapic_write_entry(apic, pin, - early_ioapic_entries[apic][pin]); - kfree(early_ioapic_entries[apic]); - early_ioapic_entries[apic] = NULL; - } -} - -void reinit_intr_remapped_IO_APIC(int intr_remapping) -{ - /* - * for now plain restore of previous settings. - * TBD: In the case of OS enabling interrupt-remapping, - * IO-APIC RTE's need to be setup to point to interrupt-remapping - * table entries. for now, do a plain restore, and wait for - * the setup_IO_APIC_irqs() to do proper initialization. - */ - restore_IO_APIC_setup(); -} -#endif - -/* - * Find the IRQ entry number of a certain pin. - */ -static int find_irq_entry(int apic, int pin, int type) -{ - int i; - - for (i = 0; i < mp_irq_entries; i++) - if (mp_irqs[i].mp_irqtype == type && - (mp_irqs[i].mp_dstapic == mp_ioapics[apic].mp_apicid || - mp_irqs[i].mp_dstapic == MP_APIC_ALL) && - mp_irqs[i].mp_dstirq == pin) - return i; - - return -1; -} - -/* - * Find the pin to which IRQ[irq] (ISA) is connected - */ -static int __init find_isa_irq_pin(int irq, int type) -{ - int i; - - for (i = 0; i < mp_irq_entries; i++) { - int lbus = mp_irqs[i].mp_srcbus; - - if (test_bit(lbus, mp_bus_not_pci) && - (mp_irqs[i].mp_irqtype == type) && - (mp_irqs[i].mp_srcbusirq == irq)) - - return mp_irqs[i].mp_dstirq; - } - return -1; -} - -static int __init find_isa_irq_apic(int irq, int type) -{ - int i; - - for (i = 0; i < mp_irq_entries; i++) { - int lbus = mp_irqs[i].mp_srcbus; - - if (test_bit(lbus, mp_bus_not_pci) && - (mp_irqs[i].mp_irqtype == type) && - (mp_irqs[i].mp_srcbusirq == irq)) - break; - } - if (i < mp_irq_entries) { - int apic; - for(apic = 0; apic < nr_ioapics; apic++) { - if (mp_ioapics[apic].mp_apicid == mp_irqs[i].mp_dstapic) - return apic; - } - } - - return -1; -} - -/* - * Find a specific PCI IRQ entry. - * Not an __init, possibly needed by modules - */ -static int pin_2_irq(int idx, int apic, int pin); - -int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin) -{ - int apic, i, best_guess = -1; - - apic_printk(APIC_DEBUG, "querying PCI -> IRQ mapping bus:%d, slot:%d, pin:%d.\n", - bus, slot, pin); - if (test_bit(bus, mp_bus_not_pci)) { - apic_printk(APIC_VERBOSE, "PCI BIOS passed nonexistent PCI bus %d!\n", bus); - return -1; - } - for (i = 0; i < mp_irq_entries; i++) { - int lbus = mp_irqs[i].mp_srcbus; - - for (apic = 0; apic < nr_ioapics; apic++) - if (mp_ioapics[apic].mp_apicid == mp_irqs[i].mp_dstapic || - mp_irqs[i].mp_dstapic == MP_APIC_ALL) - break; - - if (!test_bit(lbus, mp_bus_not_pci) && - !mp_irqs[i].mp_irqtype && - (bus == lbus) && - (slot == ((mp_irqs[i].mp_srcbusirq >> 2) & 0x1f))) { - int irq = pin_2_irq(i,apic,mp_irqs[i].mp_dstirq); - - if (!(apic || IO_APIC_IRQ(irq))) - continue; - - if (pin == (mp_irqs[i].mp_srcbusirq & 3)) - return irq; - /* - * Use the first all-but-pin matching entry as a - * best-guess fuzzy result for broken mptables. - */ - if (best_guess < 0) - best_guess = irq; - } - } - return best_guess; -} - -EXPORT_SYMBOL(IO_APIC_get_PCI_irq_vector); - -#if defined(CONFIG_EISA) || defined(CONFIG_MCA) -/* - * EISA Edge/Level control register, ELCR - */ -static int EISA_ELCR(unsigned int irq) -{ - if (irq < NR_IRQS_LEGACY) { - unsigned int port = 0x4d0 + (irq >> 3); - return (inb(port) >> (irq & 7)) & 1; - } - apic_printk(APIC_VERBOSE, KERN_INFO - "Broken MPtable reports ISA irq %d\n", irq); - return 0; -} - -#endif - -/* ISA interrupts are always polarity zero edge triggered, - * when listed as conforming in the MP table. */ - -#define default_ISA_trigger(idx) (0) -#define default_ISA_polarity(idx) (0) - -/* EISA interrupts are always polarity zero and can be edge or level - * trigger depending on the ELCR value. If an interrupt is listed as - * EISA conforming in the MP table, that means its trigger type must - * be read in from the ELCR */ - -#define default_EISA_trigger(idx) (EISA_ELCR(mp_irqs[idx].mp_srcbusirq)) -#define default_EISA_polarity(idx) default_ISA_polarity(idx) - -/* PCI interrupts are always polarity one level triggered, - * when listed as conforming in the MP table. */ - -#define default_PCI_trigger(idx) (1) -#define default_PCI_polarity(idx) (1) - -/* MCA interrupts are always polarity zero level triggered, - * when listed as conforming in the MP table. */ - -#define default_MCA_trigger(idx) (1) -#define default_MCA_polarity(idx) default_ISA_polarity(idx) - -static int MPBIOS_polarity(int idx) -{ - int bus = mp_irqs[idx].mp_srcbus; - int polarity; - - /* - * Determine IRQ line polarity (high active or low active): - */ - switch (mp_irqs[idx].mp_irqflag & 3) - { - case 0: /* conforms, ie. bus-type dependent polarity */ - if (test_bit(bus, mp_bus_not_pci)) - polarity = default_ISA_polarity(idx); - else - polarity = default_PCI_polarity(idx); - break; - case 1: /* high active */ - { - polarity = 0; - break; - } - case 2: /* reserved */ - { - printk(KERN_WARNING "broken BIOS!!\n"); - polarity = 1; - break; - } - case 3: /* low active */ - { - polarity = 1; - break; - } - default: /* invalid */ - { - printk(KERN_WARNING "broken BIOS!!\n"); - polarity = 1; - break; - } - } - return polarity; -} - -static int MPBIOS_trigger(int idx) -{ - int bus = mp_irqs[idx].mp_srcbus; - int trigger; - - /* - * Determine IRQ trigger mode (edge or level sensitive): - */ - switch ((mp_irqs[idx].mp_irqflag>>2) & 3) - { - case 0: /* conforms, ie. bus-type dependent */ - if (test_bit(bus, mp_bus_not_pci)) - trigger = default_ISA_trigger(idx); - else - trigger = default_PCI_trigger(idx); -#if defined(CONFIG_EISA) || defined(CONFIG_MCA) - switch (mp_bus_id_to_type[bus]) { - case MP_BUS_ISA: /* ISA pin */ - { - /* set before the switch */ - break; - } - case MP_BUS_EISA: /* EISA pin */ - { - trigger = default_EISA_trigger(idx); - break; - } - case MP_BUS_PCI: /* PCI pin */ - { - /* set before the switch */ - break; - } - case MP_BUS_MCA: /* MCA pin */ - { - trigger = default_MCA_trigger(idx); - break; - } - default: - { - printk(KERN_WARNING "broken BIOS!!\n"); - trigger = 1; - break; - } - } -#endif - break; - case 1: /* edge */ - { - trigger = 0; - break; - } - case 2: /* reserved */ - { - printk(KERN_WARNING "broken BIOS!!\n"); - trigger = 1; - break; - } - case 3: /* level */ - { - trigger = 1; - break; - } - default: /* invalid */ - { - printk(KERN_WARNING "broken BIOS!!\n"); - trigger = 0; - break; - } - } - return trigger; -} - -static inline int irq_polarity(int idx) -{ - return MPBIOS_polarity(idx); -} - -static inline int irq_trigger(int idx) -{ - return MPBIOS_trigger(idx); -} - -int (*ioapic_renumber_irq)(int ioapic, int irq); -static int pin_2_irq(int idx, int apic, int pin) -{ - int irq, i; - int bus = mp_irqs[idx].mp_srcbus; - - /* - * Debugging check, we are in big trouble if this message pops up! - */ - if (mp_irqs[idx].mp_dstirq != pin) - printk(KERN_ERR "broken BIOS or MPTABLE parser, ayiee!!\n"); - - if (test_bit(bus, mp_bus_not_pci)) { - irq = mp_irqs[idx].mp_srcbusirq; - } else { - /* - * PCI IRQs are mapped in order - */ - i = irq = 0; - while (i < apic) - irq += nr_ioapic_registers[i++]; - irq += pin; - /* - * For MPS mode, so far only needed by ES7000 platform - */ - if (ioapic_renumber_irq) - irq = ioapic_renumber_irq(apic, irq); - } - -#ifdef CONFIG_X86_32 - /* - * PCI IRQ command line redirection. Yes, limits are hardcoded. - */ - if ((pin >= 16) && (pin <= 23)) { - if (pirq_entries[pin-16] != -1) { - if (!pirq_entries[pin-16]) { - apic_printk(APIC_VERBOSE, KERN_DEBUG - "disabling PIRQ%d\n", pin-16); - } else { - irq = pirq_entries[pin-16]; - apic_printk(APIC_VERBOSE, KERN_DEBUG - "using PIRQ%d -> IRQ %d\n", - pin-16, irq); - } - } - } -#endif - - return irq; -} - -void lock_vector_lock(void) -{ - /* Used to the online set of cpus does not change - * during assign_irq_vector. - */ - spin_lock(&vector_lock); -} - -void unlock_vector_lock(void) -{ - spin_unlock(&vector_lock); -} - -static int -__assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) -{ - /* - * NOTE! The local APIC isn't very good at handling - * multiple interrupts at the same interrupt level. - * As the interrupt level is determined by taking the - * vector number and shifting that right by 4, we - * want to spread these out a bit so that they don't - * all fall in the same interrupt level. - * - * Also, we've got to be careful not to trash gate - * 0x80, because int 0x80 is hm, kind of importantish. ;) - */ - static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0; - unsigned int old_vector; - int cpu, err; - cpumask_var_t tmp_mask; - - if ((cfg->move_in_progress) || cfg->move_cleanup_count) - return -EBUSY; - - if (!alloc_cpumask_var(&tmp_mask, GFP_ATOMIC)) - return -ENOMEM; - - old_vector = cfg->vector; - if (old_vector) { - cpumask_and(tmp_mask, mask, cpu_online_mask); - cpumask_and(tmp_mask, cfg->domain, tmp_mask); - if (!cpumask_empty(tmp_mask)) { - free_cpumask_var(tmp_mask); - return 0; - } - } - - /* Only try and allocate irqs on cpus that are present */ - err = -ENOSPC; - for_each_cpu_and(cpu, mask, cpu_online_mask) { - int new_cpu; - int vector, offset; - - vector_allocation_domain(cpu, tmp_mask); - - vector = current_vector; - offset = current_offset; -next: - vector += 8; - if (vector >= first_system_vector) { - /* If out of vectors on large boxen, must share them. */ - offset = (offset + 1) % 8; - vector = FIRST_DEVICE_VECTOR + offset; - } - if (unlikely(current_vector == vector)) - continue; - - if (test_bit(vector, used_vectors)) - goto next; - - for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) - if (per_cpu(vector_irq, new_cpu)[vector] != -1) - goto next; - /* Found one! */ - current_vector = vector; - current_offset = offset; - if (old_vector) { - cfg->move_in_progress = 1; - cpumask_copy(cfg->old_domain, cfg->domain); - } - for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) - per_cpu(vector_irq, new_cpu)[vector] = irq; - cfg->vector = vector; - cpumask_copy(cfg->domain, tmp_mask); - err = 0; - break; - } - free_cpumask_var(tmp_mask); - return err; -} - -static int -assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) -{ - int err; - unsigned long flags; - - spin_lock_irqsave(&vector_lock, flags); - err = __assign_irq_vector(irq, cfg, mask); - spin_unlock_irqrestore(&vector_lock, flags); - return err; -} - -static void __clear_irq_vector(int irq, struct irq_cfg *cfg) -{ - int cpu, vector; - - BUG_ON(!cfg->vector); - - vector = cfg->vector; - for_each_cpu_and(cpu, cfg->domain, cpu_online_mask) - per_cpu(vector_irq, cpu)[vector] = -1; - - cfg->vector = 0; - cpumask_clear(cfg->domain); - - if (likely(!cfg->move_in_progress)) - return; - for_each_cpu_and(cpu, cfg->old_domain, cpu_online_mask) { - for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; - vector++) { - if (per_cpu(vector_irq, cpu)[vector] != irq) - continue; - per_cpu(vector_irq, cpu)[vector] = -1; - break; - } - } - cfg->move_in_progress = 0; -} - -void __setup_vector_irq(int cpu) -{ - /* Initialize vector_irq on a new cpu */ - /* This function must be called with vector_lock held */ - int irq, vector; - struct irq_cfg *cfg; - struct irq_desc *desc; - - /* Mark the inuse vectors */ - for_each_irq_desc(irq, desc) { - cfg = desc->chip_data; - if (!cpumask_test_cpu(cpu, cfg->domain)) - continue; - vector = cfg->vector; - per_cpu(vector_irq, cpu)[vector] = irq; - } - /* Mark the free vectors */ - for (vector = 0; vector < NR_VECTORS; ++vector) { - irq = per_cpu(vector_irq, cpu)[vector]; - if (irq < 0) - continue; - - cfg = irq_cfg(irq); - if (!cpumask_test_cpu(cpu, cfg->domain)) - per_cpu(vector_irq, cpu)[vector] = -1; - } -} - -static struct irq_chip ioapic_chip; -#ifdef CONFIG_INTR_REMAP -static struct irq_chip ir_ioapic_chip; -#endif - -#define IOAPIC_AUTO -1 -#define IOAPIC_EDGE 0 -#define IOAPIC_LEVEL 1 - -#ifdef CONFIG_X86_32 -static inline int IO_APIC_irq_trigger(int irq) -{ - int apic, idx, pin; - - for (apic = 0; apic < nr_ioapics; apic++) { - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { - idx = find_irq_entry(apic, pin, mp_INT); - if ((idx != -1) && (irq == pin_2_irq(idx, apic, pin))) - return irq_trigger(idx); - } - } - /* - * nonexistent IRQs are edge default - */ - return 0; -} -#else -static inline int IO_APIC_irq_trigger(int irq) -{ - return 1; -} -#endif - -static void ioapic_register_intr(int irq, struct irq_desc *desc, unsigned long trigger) -{ - - if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || - trigger == IOAPIC_LEVEL) - desc->status |= IRQ_LEVEL; - else - desc->status &= ~IRQ_LEVEL; - -#ifdef CONFIG_INTR_REMAP - if (irq_remapped(irq)) { - desc->status |= IRQ_MOVE_PCNTXT; - if (trigger) - set_irq_chip_and_handler_name(irq, &ir_ioapic_chip, - handle_fasteoi_irq, - "fasteoi"); - else - set_irq_chip_and_handler_name(irq, &ir_ioapic_chip, - handle_edge_irq, "edge"); - return; - } -#endif - if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || - trigger == IOAPIC_LEVEL) - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_fasteoi_irq, - "fasteoi"); - else - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_edge_irq, "edge"); -} - -static int setup_ioapic_entry(int apic, int irq, - struct IO_APIC_route_entry *entry, - unsigned int destination, int trigger, - int polarity, int vector) -{ - /* - * add it to the IO-APIC irq-routing table: - */ - memset(entry,0,sizeof(*entry)); - -#ifdef CONFIG_INTR_REMAP - if (intr_remapping_enabled) { - struct intel_iommu *iommu = map_ioapic_to_ir(apic); - struct irte irte; - struct IR_IO_APIC_route_entry *ir_entry = - (struct IR_IO_APIC_route_entry *) entry; - int index; - - if (!iommu) - panic("No mapping iommu for ioapic %d\n", apic); - - index = alloc_irte(iommu, irq, 1); - if (index < 0) - panic("Failed to allocate IRTE for ioapic %d\n", apic); - - memset(&irte, 0, sizeof(irte)); - - irte.present = 1; - irte.dst_mode = INT_DEST_MODE; - irte.trigger_mode = trigger; - irte.dlvry_mode = INT_DELIVERY_MODE; - irte.vector = vector; - irte.dest_id = IRTE_DEST(destination); - - modify_irte(irq, &irte); - - ir_entry->index2 = (index >> 15) & 0x1; - ir_entry->zero = 0; - ir_entry->format = 1; - ir_entry->index = (index & 0x7fff); - } else -#endif - { - entry->delivery_mode = INT_DELIVERY_MODE; - entry->dest_mode = INT_DEST_MODE; - entry->dest = destination; - } - - entry->mask = 0; /* enable IRQ */ - entry->trigger = trigger; - entry->polarity = polarity; - entry->vector = vector; - - /* Mask level triggered irqs. - * Use IRQ_DELAYED_DISABLE for edge triggered irqs. - */ - if (trigger) - entry->mask = 1; - return 0; -} - -static void setup_IO_APIC_irq(int apic, int pin, unsigned int irq, struct irq_desc *desc, - int trigger, int polarity) -{ - struct irq_cfg *cfg; - struct IO_APIC_route_entry entry; - unsigned int dest; - - if (!IO_APIC_IRQ(irq)) - return; - - cfg = desc->chip_data; - - if (assign_irq_vector(irq, cfg, TARGET_CPUS)) - return; - - dest = cpu_mask_to_apicid_and(cfg->domain, TARGET_CPUS); - - apic_printk(APIC_VERBOSE,KERN_DEBUG - "IOAPIC[%d]: Set routing entry (%d-%d -> 0x%x -> " - "IRQ %d Mode:%i Active:%i)\n", - apic, mp_ioapics[apic].mp_apicid, pin, cfg->vector, - irq, trigger, polarity); - - - if (setup_ioapic_entry(mp_ioapics[apic].mp_apicid, irq, &entry, - dest, trigger, polarity, cfg->vector)) { - printk("Failed to setup ioapic entry for ioapic %d, pin %d\n", - mp_ioapics[apic].mp_apicid, pin); - __clear_irq_vector(irq, cfg); - return; - } - - ioapic_register_intr(irq, desc, trigger); - if (irq < NR_IRQS_LEGACY) - disable_8259A_irq(irq); - - ioapic_write_entry(apic, pin, entry); -} - -static void __init setup_IO_APIC_irqs(void) -{ - int apic, pin, idx, irq; - int notcon = 0; - struct irq_desc *desc; - struct irq_cfg *cfg; - int cpu = boot_cpu_id; - - apic_printk(APIC_VERBOSE, KERN_DEBUG "init IO_APIC IRQs\n"); - - for (apic = 0; apic < nr_ioapics; apic++) { - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { - - idx = find_irq_entry(apic, pin, mp_INT); - if (idx == -1) { - if (!notcon) { - notcon = 1; - apic_printk(APIC_VERBOSE, - KERN_DEBUG " %d-%d", - mp_ioapics[apic].mp_apicid, - pin); - } else - apic_printk(APIC_VERBOSE, " %d-%d", - mp_ioapics[apic].mp_apicid, - pin); - continue; - } - if (notcon) { - apic_printk(APIC_VERBOSE, - " (apicid-pin) not connected\n"); - notcon = 0; - } - - irq = pin_2_irq(idx, apic, pin); -#ifdef CONFIG_X86_32 - if (multi_timer_check(apic, irq)) - continue; -#endif - desc = irq_to_desc_alloc_cpu(irq, cpu); - if (!desc) { - printk(KERN_INFO "can not get irq_desc for %d\n", irq); - continue; - } - cfg = desc->chip_data; - add_pin_to_irq_cpu(cfg, cpu, apic, pin); - - setup_IO_APIC_irq(apic, pin, irq, desc, - irq_trigger(idx), irq_polarity(idx)); - } - } - - if (notcon) - apic_printk(APIC_VERBOSE, - " (apicid-pin) not connected\n"); -} - -/* - * Set up the timer pin, possibly with the 8259A-master behind. - */ -static void __init setup_timer_IRQ0_pin(unsigned int apic, unsigned int pin, - int vector) -{ - struct IO_APIC_route_entry entry; - -#ifdef CONFIG_INTR_REMAP - if (intr_remapping_enabled) - return; -#endif - - memset(&entry, 0, sizeof(entry)); - - /* - * We use logical delivery to get the timer IRQ - * to the first CPU. - */ - entry.dest_mode = INT_DEST_MODE; - entry.mask = 1; /* mask IRQ now */ - entry.dest = cpu_mask_to_apicid(TARGET_CPUS); - entry.delivery_mode = INT_DELIVERY_MODE; - entry.polarity = 0; - entry.trigger = 0; - entry.vector = vector; - - /* - * The timer IRQ doesn't have to know that behind the - * scene we may have a 8259A-master in AEOI mode ... - */ - set_irq_chip_and_handler_name(0, &ioapic_chip, handle_edge_irq, "edge"); - - /* - * Add it to the IO-APIC irq-routing table: - */ - ioapic_write_entry(apic, pin, entry); -} - - -__apicdebuginit(void) print_IO_APIC(void) -{ - int apic, i; - union IO_APIC_reg_00 reg_00; - union IO_APIC_reg_01 reg_01; - union IO_APIC_reg_02 reg_02; - union IO_APIC_reg_03 reg_03; - unsigned long flags; - struct irq_cfg *cfg; - struct irq_desc *desc; - unsigned int irq; - - if (apic_verbosity == APIC_QUIET) - return; - - printk(KERN_DEBUG "number of MP IRQ sources: %d.\n", mp_irq_entries); - for (i = 0; i < nr_ioapics; i++) - printk(KERN_DEBUG "number of IO-APIC #%d registers: %d.\n", - mp_ioapics[i].mp_apicid, nr_ioapic_registers[i]); - - /* - * We are a bit conservative about what we expect. We have to - * know about every hardware change ASAP. - */ - printk(KERN_INFO "testing the IO APIC.......................\n"); - - for (apic = 0; apic < nr_ioapics; apic++) { - - spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(apic, 0); - reg_01.raw = io_apic_read(apic, 1); - if (reg_01.bits.version >= 0x10) - reg_02.raw = io_apic_read(apic, 2); - if (reg_01.bits.version >= 0x20) - reg_03.raw = io_apic_read(apic, 3); - spin_unlock_irqrestore(&ioapic_lock, flags); - - printk("\n"); - printk(KERN_DEBUG "IO APIC #%d......\n", mp_ioapics[apic].mp_apicid); - printk(KERN_DEBUG ".... register #00: %08X\n", reg_00.raw); - printk(KERN_DEBUG "....... : physical APIC id: %02X\n", reg_00.bits.ID); - printk(KERN_DEBUG "....... : Delivery Type: %X\n", reg_00.bits.delivery_type); - printk(KERN_DEBUG "....... : LTS : %X\n", reg_00.bits.LTS); - - printk(KERN_DEBUG ".... register #01: %08X\n", *(int *)®_01); - printk(KERN_DEBUG "....... : max redirection entries: %04X\n", reg_01.bits.entries); - - printk(KERN_DEBUG "....... : PRQ implemented: %X\n", reg_01.bits.PRQ); - printk(KERN_DEBUG "....... : IO APIC version: %04X\n", reg_01.bits.version); - - /* - * Some Intel chipsets with IO APIC VERSION of 0x1? don't have reg_02, - * but the value of reg_02 is read as the previous read register - * value, so ignore it if reg_02 == reg_01. - */ - if (reg_01.bits.version >= 0x10 && reg_02.raw != reg_01.raw) { - printk(KERN_DEBUG ".... register #02: %08X\n", reg_02.raw); - printk(KERN_DEBUG "....... : arbitration: %02X\n", reg_02.bits.arbitration); - } - - /* - * Some Intel chipsets with IO APIC VERSION of 0x2? don't have reg_02 - * or reg_03, but the value of reg_0[23] is read as the previous read - * register value, so ignore it if reg_03 == reg_0[12]. - */ - if (reg_01.bits.version >= 0x20 && reg_03.raw != reg_02.raw && - reg_03.raw != reg_01.raw) { - printk(KERN_DEBUG ".... register #03: %08X\n", reg_03.raw); - printk(KERN_DEBUG "....... : Boot DT : %X\n", reg_03.bits.boot_DT); - } - - printk(KERN_DEBUG ".... IRQ redirection table:\n"); - - printk(KERN_DEBUG " NR Dst Mask Trig IRR Pol" - " Stat Dmod Deli Vect: \n"); - - for (i = 0; i <= reg_01.bits.entries; i++) { - struct IO_APIC_route_entry entry; - - entry = ioapic_read_entry(apic, i); - - printk(KERN_DEBUG " %02x %03X ", - i, - entry.dest - ); - - printk("%1d %1d %1d %1d %1d %1d %1d %02X\n", - entry.mask, - entry.trigger, - entry.irr, - entry.polarity, - entry.delivery_status, - entry.dest_mode, - entry.delivery_mode, - entry.vector - ); - } - } - printk(KERN_DEBUG "IRQ to pin mappings:\n"); - for_each_irq_desc(irq, desc) { - struct irq_pin_list *entry; - - cfg = desc->chip_data; - entry = cfg->irq_2_pin; - if (!entry) - continue; - printk(KERN_DEBUG "IRQ%d ", irq); - for (;;) { - printk("-> %d:%d", entry->apic, entry->pin); - if (!entry->next) - break; - entry = entry->next; - } - printk("\n"); - } - - printk(KERN_INFO ".................................... done.\n"); - - return; -} - -__apicdebuginit(void) print_APIC_bitfield(int base) -{ - unsigned int v; - int i, j; - - if (apic_verbosity == APIC_QUIET) - return; - - printk(KERN_DEBUG "0123456789abcdef0123456789abcdef\n" KERN_DEBUG); - for (i = 0; i < 8; i++) { - v = apic_read(base + i*0x10); - for (j = 0; j < 32; j++) { - if (v & (1< 3) /* Due to the Pentium erratum 3AP. */ - apic_write(APIC_ESR, 0); - - v = apic_read(APIC_ESR); - printk(KERN_DEBUG "... APIC ESR: %08x\n", v); - } - - icr = apic_icr_read(); - printk(KERN_DEBUG "... APIC ICR: %08x\n", (u32)icr); - printk(KERN_DEBUG "... APIC ICR2: %08x\n", (u32)(icr >> 32)); - - v = apic_read(APIC_LVTT); - printk(KERN_DEBUG "... APIC LVTT: %08x\n", v); - - if (maxlvt > 3) { /* PC is LVT#4. */ - v = apic_read(APIC_LVTPC); - printk(KERN_DEBUG "... APIC LVTPC: %08x\n", v); - } - v = apic_read(APIC_LVT0); - printk(KERN_DEBUG "... APIC LVT0: %08x\n", v); - v = apic_read(APIC_LVT1); - printk(KERN_DEBUG "... APIC LVT1: %08x\n", v); - - if (maxlvt > 2) { /* ERR is LVT#3. */ - v = apic_read(APIC_LVTERR); - printk(KERN_DEBUG "... APIC LVTERR: %08x\n", v); - } - - v = apic_read(APIC_TMICT); - printk(KERN_DEBUG "... APIC TMICT: %08x\n", v); - v = apic_read(APIC_TMCCT); - printk(KERN_DEBUG "... APIC TMCCT: %08x\n", v); - v = apic_read(APIC_TDCR); - printk(KERN_DEBUG "... APIC TDCR: %08x\n", v); - printk("\n"); -} - -__apicdebuginit(void) print_all_local_APICs(void) -{ - int cpu; - - preempt_disable(); - for_each_online_cpu(cpu) - smp_call_function_single(cpu, print_local_APIC, NULL, 1); - preempt_enable(); -} - -__apicdebuginit(void) print_PIC(void) -{ - unsigned int v; - unsigned long flags; - - if (apic_verbosity == APIC_QUIET) - return; - - printk(KERN_DEBUG "\nprinting PIC contents\n"); - - spin_lock_irqsave(&i8259A_lock, flags); - - v = inb(0xa1) << 8 | inb(0x21); - printk(KERN_DEBUG "... PIC IMR: %04x\n", v); - - v = inb(0xa0) << 8 | inb(0x20); - printk(KERN_DEBUG "... PIC IRR: %04x\n", v); - - outb(0x0b,0xa0); - outb(0x0b,0x20); - v = inb(0xa0) << 8 | inb(0x20); - outb(0x0a,0xa0); - outb(0x0a,0x20); - - spin_unlock_irqrestore(&i8259A_lock, flags); - - printk(KERN_DEBUG "... PIC ISR: %04x\n", v); - - v = inb(0x4d1) << 8 | inb(0x4d0); - printk(KERN_DEBUG "... PIC ELCR: %04x\n", v); -} - -__apicdebuginit(int) print_all_ICs(void) -{ - print_PIC(); - print_all_local_APICs(); - print_IO_APIC(); - - return 0; -} - -fs_initcall(print_all_ICs); - - -/* Where if anywhere is the i8259 connect in external int mode */ -static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; - -void __init enable_IO_APIC(void) -{ - union IO_APIC_reg_01 reg_01; - int i8259_apic, i8259_pin; - int apic; - unsigned long flags; - -#ifdef CONFIG_X86_32 - int i; - if (!pirqs_enabled) - for (i = 0; i < MAX_PIRQS; i++) - pirq_entries[i] = -1; -#endif - - /* - * The number of IO-APIC IRQ registers (== #pins): - */ - for (apic = 0; apic < nr_ioapics; apic++) { - spin_lock_irqsave(&ioapic_lock, flags); - reg_01.raw = io_apic_read(apic, 1); - spin_unlock_irqrestore(&ioapic_lock, flags); - nr_ioapic_registers[apic] = reg_01.bits.entries+1; - } - for(apic = 0; apic < nr_ioapics; apic++) { - int pin; - /* See if any of the pins is in ExtINT mode */ - for (pin = 0; pin < nr_ioapic_registers[apic]; pin++) { - struct IO_APIC_route_entry entry; - entry = ioapic_read_entry(apic, pin); - - /* If the interrupt line is enabled and in ExtInt mode - * I have found the pin where the i8259 is connected. - */ - if ((entry.mask == 0) && (entry.delivery_mode == dest_ExtINT)) { - ioapic_i8259.apic = apic; - ioapic_i8259.pin = pin; - goto found_i8259; - } - } - } - found_i8259: - /* Look to see what if the MP table has reported the ExtINT */ - /* If we could not find the appropriate pin by looking at the ioapic - * the i8259 probably is not connected the ioapic but give the - * mptable a chance anyway. - */ - i8259_pin = find_isa_irq_pin(0, mp_ExtINT); - i8259_apic = find_isa_irq_apic(0, mp_ExtINT); - /* Trust the MP table if nothing is setup in the hardware */ - if ((ioapic_i8259.pin == -1) && (i8259_pin >= 0)) { - printk(KERN_WARNING "ExtINT not setup in hardware but reported by MP table\n"); - ioapic_i8259.pin = i8259_pin; - ioapic_i8259.apic = i8259_apic; - } - /* Complain if the MP table and the hardware disagree */ - if (((ioapic_i8259.apic != i8259_apic) || (ioapic_i8259.pin != i8259_pin)) && - (i8259_pin >= 0) && (ioapic_i8259.pin >= 0)) - { - printk(KERN_WARNING "ExtINT in hardware and MP table differ\n"); - } - - /* - * Do not trust the IO-APIC being empty at bootup - */ - clear_IO_APIC(); -} - -/* - * Not an __init, needed by the reboot code - */ -void disable_IO_APIC(void) -{ - /* - * Clear the IO-APIC before rebooting: - */ - clear_IO_APIC(); - - /* - * If the i8259 is routed through an IOAPIC - * Put that IOAPIC in virtual wire mode - * so legacy interrupts can be delivered. - */ - if (ioapic_i8259.pin != -1) { - struct IO_APIC_route_entry entry; - - memset(&entry, 0, sizeof(entry)); - entry.mask = 0; /* Enabled */ - entry.trigger = 0; /* Edge */ - entry.irr = 0; - entry.polarity = 0; /* High */ - entry.delivery_status = 0; - entry.dest_mode = 0; /* Physical */ - entry.delivery_mode = dest_ExtINT; /* ExtInt */ - entry.vector = 0; - entry.dest = read_apic_id(); - - /* - * Add it to the IO-APIC irq-routing table: - */ - ioapic_write_entry(ioapic_i8259.apic, ioapic_i8259.pin, entry); - } - - disconnect_bsp_APIC(ioapic_i8259.pin != -1); -} - -#ifdef CONFIG_X86_32 -/* - * function to set the IO-APIC physical IDs based on the - * values stored in the MPC table. - * - * by Matt Domsch Tue Dec 21 12:25:05 CST 1999 - */ - -static void __init setup_ioapic_ids_from_mpc(void) -{ - union IO_APIC_reg_00 reg_00; - physid_mask_t phys_id_present_map; - int apic; - int i; - unsigned char old_id; - unsigned long flags; - - if (x86_quirks->setup_ioapic_ids && x86_quirks->setup_ioapic_ids()) - return; - - /* - * Don't check I/O APIC IDs for xAPIC systems. They have - * no meaning without the serial APIC bus. - */ - if (!(boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) - || APIC_XAPIC(apic_version[boot_cpu_physical_apicid])) - return; - /* - * This is broken; anything with a real cpu count has to - * circumvent this idiocy regardless. - */ - phys_id_present_map = ioapic_phys_id_map(phys_cpu_present_map); - - /* - * Set the IOAPIC ID to the value stored in the MPC table. - */ - for (apic = 0; apic < nr_ioapics; apic++) { - - /* Read the register 0 value */ - spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(apic, 0); - spin_unlock_irqrestore(&ioapic_lock, flags); - - old_id = mp_ioapics[apic].mp_apicid; - - if (mp_ioapics[apic].mp_apicid >= get_physical_broadcast()) { - printk(KERN_ERR "BIOS bug, IO-APIC#%d ID is %d in the MPC table!...\n", - apic, mp_ioapics[apic].mp_apicid); - printk(KERN_ERR "... fixing up to %d. (tell your hw vendor)\n", - reg_00.bits.ID); - mp_ioapics[apic].mp_apicid = reg_00.bits.ID; - } - - /* - * Sanity check, is the ID really free? Every APIC in a - * system must have a unique ID or we get lots of nice - * 'stuck on smp_invalidate_needed IPI wait' messages. - */ - if (check_apicid_used(phys_id_present_map, - mp_ioapics[apic].mp_apicid)) { - printk(KERN_ERR "BIOS bug, IO-APIC#%d ID %d is already used!...\n", - apic, mp_ioapics[apic].mp_apicid); - for (i = 0; i < get_physical_broadcast(); i++) - if (!physid_isset(i, phys_id_present_map)) - break; - if (i >= get_physical_broadcast()) - panic("Max APIC ID exceeded!\n"); - printk(KERN_ERR "... fixing up to %d. (tell your hw vendor)\n", - i); - physid_set(i, phys_id_present_map); - mp_ioapics[apic].mp_apicid = i; - } else { - physid_mask_t tmp; - tmp = apicid_to_cpu_present(mp_ioapics[apic].mp_apicid); - apic_printk(APIC_VERBOSE, "Setting %d in the " - "phys_id_present_map\n", - mp_ioapics[apic].mp_apicid); - physids_or(phys_id_present_map, phys_id_present_map, tmp); - } - - - /* - * We need to adjust the IRQ routing table - * if the ID changed. - */ - if (old_id != mp_ioapics[apic].mp_apicid) - for (i = 0; i < mp_irq_entries; i++) - if (mp_irqs[i].mp_dstapic == old_id) - mp_irqs[i].mp_dstapic - = mp_ioapics[apic].mp_apicid; - - /* - * Read the right value from the MPC table and - * write it into the ID register. - */ - apic_printk(APIC_VERBOSE, KERN_INFO - "...changing IO-APIC physical APIC ID to %d ...", - mp_ioapics[apic].mp_apicid); - - reg_00.bits.ID = mp_ioapics[apic].mp_apicid; - spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(apic, 0, reg_00.raw); - spin_unlock_irqrestore(&ioapic_lock, flags); - - /* - * Sanity check - */ - spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(apic, 0); - spin_unlock_irqrestore(&ioapic_lock, flags); - if (reg_00.bits.ID != mp_ioapics[apic].mp_apicid) - printk("could not set ID!\n"); - else - apic_printk(APIC_VERBOSE, " ok.\n"); - } -} -#endif - -int no_timer_check __initdata; - -static int __init notimercheck(char *s) -{ - no_timer_check = 1; - return 1; -} -__setup("no_timer_check", notimercheck); - -/* - * There is a nasty bug in some older SMP boards, their mptable lies - * about the timer IRQ. We do the following to work around the situation: - * - * - timer IRQ defaults to IO-APIC IRQ - * - if this function detects that timer IRQs are defunct, then we fall - * back to ISA timer IRQs - */ -static int __init timer_irq_works(void) -{ - unsigned long t1 = jiffies; - unsigned long flags; - - if (no_timer_check) - return 1; - - local_save_flags(flags); - local_irq_enable(); - /* Let ten ticks pass... */ - mdelay((10 * 1000) / HZ); - local_irq_restore(flags); - - /* - * Expect a few ticks at least, to be sure some possible - * glue logic does not lock up after one or two first - * ticks in a non-ExtINT mode. Also the local APIC - * might have cached one ExtINT interrupt. Finally, at - * least one tick may be lost due to delays. - */ - - /* jiffies wrap? */ - if (time_after(jiffies, t1 + 4)) - return 1; - return 0; -} - -/* - * In the SMP+IOAPIC case it might happen that there are an unspecified - * number of pending IRQ events unhandled. These cases are very rare, - * so we 'resend' these IRQs via IPIs, to the same CPU. It's much - * better to do it this way as thus we do not have to be aware of - * 'pending' interrupts in the IRQ path, except at this point. - */ -/* - * Edge triggered needs to resend any interrupt - * that was delayed but this is now handled in the device - * independent code. - */ - -/* - * Starting up a edge-triggered IO-APIC interrupt is - * nasty - we need to make sure that we get the edge. - * If it is already asserted for some reason, we need - * return 1 to indicate that is was pending. - * - * This is not complete - we should be able to fake - * an edge even if it isn't on the 8259A... - */ - -static unsigned int startup_ioapic_irq(unsigned int irq) -{ - int was_pending = 0; - unsigned long flags; - struct irq_cfg *cfg; - - spin_lock_irqsave(&ioapic_lock, flags); - if (irq < NR_IRQS_LEGACY) { - disable_8259A_irq(irq); - if (i8259A_irq_pending(irq)) - was_pending = 1; - } - cfg = irq_cfg(irq); - __unmask_IO_APIC_irq(cfg); - spin_unlock_irqrestore(&ioapic_lock, flags); - - return was_pending; -} - -#ifdef CONFIG_X86_64 -static int ioapic_retrigger_irq(unsigned int irq) -{ - - struct irq_cfg *cfg = irq_cfg(irq); - unsigned long flags; - - spin_lock_irqsave(&vector_lock, flags); - send_IPI_mask(cpumask_of(cpumask_first(cfg->domain)), cfg->vector); - spin_unlock_irqrestore(&vector_lock, flags); - - return 1; -} -#else -static int ioapic_retrigger_irq(unsigned int irq) -{ - send_IPI_self(irq_cfg(irq)->vector); - - return 1; -} -#endif - -/* - * Level and edge triggered IO-APIC interrupts need different handling, - * so we use two separate IRQ descriptors. Edge triggered IRQs can be - * handled with the level-triggered descriptor, but that one has slightly - * more overhead. Level-triggered interrupts cannot be handled with the - * edge-triggered handler, without risking IRQ storms and other ugly - * races. - */ - -#ifdef CONFIG_SMP - -#ifdef CONFIG_INTR_REMAP -static void ir_irq_migration(struct work_struct *work); - -static DECLARE_DELAYED_WORK(ir_migration_work, ir_irq_migration); - -/* - * Migrate the IO-APIC irq in the presence of intr-remapping. - * - * For edge triggered, irq migration is a simple atomic update(of vector - * and cpu destination) of IRTE and flush the hardware cache. - * - * For level triggered, we need to modify the io-apic RTE aswell with the update - * vector information, along with modifying IRTE with vector and destination. - * So irq migration for level triggered is little bit more complex compared to - * edge triggered migration. But the good news is, we use the same algorithm - * for level triggered migration as we have today, only difference being, - * we now initiate the irq migration from process context instead of the - * interrupt context. - * - * In future, when we do a directed EOI (combined with cpu EOI broadcast - * suppression) to the IO-APIC, level triggered irq migration will also be - * as simple as edge triggered migration and we can do the irq migration - * with a simple atomic update to IO-APIC RTE. - */ -static void -migrate_ioapic_irq_desc(struct irq_desc *desc, const struct cpumask *mask) -{ - struct irq_cfg *cfg; - struct irte irte; - int modify_ioapic_rte; - unsigned int dest; - unsigned long flags; - unsigned int irq; - - if (!cpumask_intersects(mask, cpu_online_mask)) - return; - - irq = desc->irq; - if (get_irte(irq, &irte)) - return; - - cfg = desc->chip_data; - if (assign_irq_vector(irq, cfg, mask)) - return; - - set_extra_move_desc(desc, mask); - - dest = cpu_mask_to_apicid_and(cfg->domain, mask); - - modify_ioapic_rte = desc->status & IRQ_LEVEL; - if (modify_ioapic_rte) { - spin_lock_irqsave(&ioapic_lock, flags); - __target_IO_APIC_irq(irq, dest, cfg); - spin_unlock_irqrestore(&ioapic_lock, flags); - } - - irte.vector = cfg->vector; - irte.dest_id = IRTE_DEST(dest); - - /* - * Modified the IRTE and flushes the Interrupt entry cache. - */ - modify_irte(irq, &irte); - - if (cfg->move_in_progress) - send_cleanup_vector(cfg); - - cpumask_copy(&desc->affinity, mask); -} - -static int migrate_irq_remapped_level_desc(struct irq_desc *desc) -{ - int ret = -1; - struct irq_cfg *cfg = desc->chip_data; - - mask_IO_APIC_irq_desc(desc); - - if (io_apic_level_ack_pending(cfg)) { - /* - * Interrupt in progress. Migrating irq now will change the - * vector information in the IO-APIC RTE and that will confuse - * the EOI broadcast performed by cpu. - * So, delay the irq migration to the next instance. - */ - schedule_delayed_work(&ir_migration_work, 1); - goto unmask; - } - - /* everthing is clear. we have right of way */ - migrate_ioapic_irq_desc(desc, &desc->pending_mask); - - ret = 0; - desc->status &= ~IRQ_MOVE_PENDING; - cpumask_clear(&desc->pending_mask); - -unmask: - unmask_IO_APIC_irq_desc(desc); - - return ret; -} - -static void ir_irq_migration(struct work_struct *work) -{ - unsigned int irq; - struct irq_desc *desc; - - for_each_irq_desc(irq, desc) { - if (desc->status & IRQ_MOVE_PENDING) { - unsigned long flags; - - spin_lock_irqsave(&desc->lock, flags); - if (!desc->chip->set_affinity || - !(desc->status & IRQ_MOVE_PENDING)) { - desc->status &= ~IRQ_MOVE_PENDING; - spin_unlock_irqrestore(&desc->lock, flags); - continue; - } - - desc->chip->set_affinity(irq, &desc->pending_mask); - spin_unlock_irqrestore(&desc->lock, flags); - } - } -} - -/* - * Migrates the IRQ destination in the process context. - */ -static void set_ir_ioapic_affinity_irq_desc(struct irq_desc *desc, - const struct cpumask *mask) -{ - if (desc->status & IRQ_LEVEL) { - desc->status |= IRQ_MOVE_PENDING; - cpumask_copy(&desc->pending_mask, mask); - migrate_irq_remapped_level_desc(desc); - return; - } - - migrate_ioapic_irq_desc(desc, mask); -} -static void set_ir_ioapic_affinity_irq(unsigned int irq, - const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - - set_ir_ioapic_affinity_irq_desc(desc, mask); -} -#endif - -asmlinkage void smp_irq_move_cleanup_interrupt(void) -{ - unsigned vector, me; - - ack_APIC_irq(); - exit_idle(); - irq_enter(); - - me = smp_processor_id(); - for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) { - unsigned int irq; - struct irq_desc *desc; - struct irq_cfg *cfg; - irq = __get_cpu_var(vector_irq)[vector]; - - if (irq == -1) - continue; - - desc = irq_to_desc(irq); - if (!desc) - continue; - - cfg = irq_cfg(irq); - spin_lock(&desc->lock); - if (!cfg->move_cleanup_count) - goto unlock; - - if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) - goto unlock; - - __get_cpu_var(vector_irq)[vector] = -1; - cfg->move_cleanup_count--; -unlock: - spin_unlock(&desc->lock); - } - - irq_exit(); -} - -static void irq_complete_move(struct irq_desc **descp) -{ - struct irq_desc *desc = *descp; - struct irq_cfg *cfg = desc->chip_data; - unsigned vector, me; - - if (likely(!cfg->move_in_progress)) { -#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC - if (likely(!cfg->move_desc_pending)) - return; - - /* domain has not changed, but affinity did */ - me = smp_processor_id(); - if (cpu_isset(me, desc->affinity)) { - *descp = desc = move_irq_desc(desc, me); - /* get the new one */ - cfg = desc->chip_data; - cfg->move_desc_pending = 0; - } -#endif - return; - } - - vector = ~get_irq_regs()->orig_ax; - me = smp_processor_id(); - - if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) { -#ifdef CONFIG_NUMA_MIGRATE_IRQ_DESC - *descp = desc = move_irq_desc(desc, me); - /* get the new one */ - cfg = desc->chip_data; -#endif - send_cleanup_vector(cfg); - } -} -#else -static inline void irq_complete_move(struct irq_desc **descp) {} -#endif - -#ifdef CONFIG_INTR_REMAP -static void ack_x2apic_level(unsigned int irq) -{ - ack_x2APIC_irq(); -} - -static void ack_x2apic_edge(unsigned int irq) -{ - ack_x2APIC_irq(); -} - -#endif - -static void ack_apic_edge(unsigned int irq) -{ - struct irq_desc *desc = irq_to_desc(irq); - - irq_complete_move(&desc); - move_native_irq(irq); - ack_APIC_irq(); -} - -atomic_t irq_mis_count; - -static void ack_apic_level(unsigned int irq) -{ - struct irq_desc *desc = irq_to_desc(irq); - -#ifdef CONFIG_X86_32 - unsigned long v; - int i; -#endif - struct irq_cfg *cfg; - int do_unmask_irq = 0; - - irq_complete_move(&desc); -#ifdef CONFIG_GENERIC_PENDING_IRQ - /* If we are moving the irq we need to mask it */ - if (unlikely(desc->status & IRQ_MOVE_PENDING)) { - do_unmask_irq = 1; - mask_IO_APIC_irq_desc(desc); - } -#endif - -#ifdef CONFIG_X86_32 - /* - * It appears there is an erratum which affects at least version 0x11 - * of I/O APIC (that's the 82093AA and cores integrated into various - * chipsets). Under certain conditions a level-triggered interrupt is - * erroneously delivered as edge-triggered one but the respective IRR - * bit gets set nevertheless. As a result the I/O unit expects an EOI - * message but it will never arrive and further interrupts are blocked - * from the source. The exact reason is so far unknown, but the - * phenomenon was observed when two consecutive interrupt requests - * from a given source get delivered to the same CPU and the source is - * temporarily disabled in between. - * - * A workaround is to simulate an EOI message manually. We achieve it - * by setting the trigger mode to edge and then to level when the edge - * trigger mode gets detected in the TMR of a local APIC for a - * level-triggered interrupt. We mask the source for the time of the - * operation to prevent an edge-triggered interrupt escaping meanwhile. - * The idea is from Manfred Spraul. --macro - */ - cfg = desc->chip_data; - i = cfg->vector; - - v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1)); -#endif - - /* - * We must acknowledge the irq before we move it or the acknowledge will - * not propagate properly. - */ - ack_APIC_irq(); - - /* Now we can move and renable the irq */ - if (unlikely(do_unmask_irq)) { - /* Only migrate the irq if the ack has been received. - * - * On rare occasions the broadcast level triggered ack gets - * delayed going to ioapics, and if we reprogram the - * vector while Remote IRR is still set the irq will never - * fire again. - * - * To prevent this scenario we read the Remote IRR bit - * of the ioapic. This has two effects. - * - On any sane system the read of the ioapic will - * flush writes (and acks) going to the ioapic from - * this cpu. - * - We get to see if the ACK has actually been delivered. - * - * Based on failed experiments of reprogramming the - * ioapic entry from outside of irq context starting - * with masking the ioapic entry and then polling until - * Remote IRR was clear before reprogramming the - * ioapic I don't trust the Remote IRR bit to be - * completey accurate. - * - * However there appears to be no other way to plug - * this race, so if the Remote IRR bit is not - * accurate and is causing problems then it is a hardware bug - * and you can go talk to the chipset vendor about it. - */ - cfg = desc->chip_data; - if (!io_apic_level_ack_pending(cfg)) - move_masked_irq(irq); - unmask_IO_APIC_irq_desc(desc); - } - -#ifdef CONFIG_X86_32 - if (!(v & (1 << (i & 0x1f)))) { - atomic_inc(&irq_mis_count); - spin_lock(&ioapic_lock); - __mask_and_edge_IO_APIC_irq(cfg); - __unmask_and_level_IO_APIC_irq(cfg); - spin_unlock(&ioapic_lock); - } -#endif -} - -static struct irq_chip ioapic_chip __read_mostly = { - .name = "IO-APIC", - .startup = startup_ioapic_irq, - .mask = mask_IO_APIC_irq, - .unmask = unmask_IO_APIC_irq, - .ack = ack_apic_edge, - .eoi = ack_apic_level, -#ifdef CONFIG_SMP - .set_affinity = set_ioapic_affinity_irq, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -#ifdef CONFIG_INTR_REMAP -static struct irq_chip ir_ioapic_chip __read_mostly = { - .name = "IR-IO-APIC", - .startup = startup_ioapic_irq, - .mask = mask_IO_APIC_irq, - .unmask = unmask_IO_APIC_irq, - .ack = ack_x2apic_edge, - .eoi = ack_x2apic_level, -#ifdef CONFIG_SMP - .set_affinity = set_ir_ioapic_affinity_irq, -#endif - .retrigger = ioapic_retrigger_irq, -}; -#endif - -static inline void init_IO_APIC_traps(void) -{ - int irq; - struct irq_desc *desc; - struct irq_cfg *cfg; - - /* - * NOTE! The local APIC isn't very good at handling - * multiple interrupts at the same interrupt level. - * As the interrupt level is determined by taking the - * vector number and shifting that right by 4, we - * want to spread these out a bit so that they don't - * all fall in the same interrupt level. - * - * Also, we've got to be careful not to trash gate - * 0x80, because int 0x80 is hm, kind of importantish. ;) - */ - for_each_irq_desc(irq, desc) { - cfg = desc->chip_data; - if (IO_APIC_IRQ(irq) && cfg && !cfg->vector) { - /* - * Hmm.. We don't have an entry for this, - * so default to an old-fashioned 8259 - * interrupt if we can.. - */ - if (irq < NR_IRQS_LEGACY) - make_8259A_irq(irq); - else - /* Strange. Oh, well.. */ - desc->chip = &no_irq_chip; - } - } -} - -/* - * The local APIC irq-chip implementation: - */ - -static void mask_lapic_irq(unsigned int irq) -{ - unsigned long v; - - v = apic_read(APIC_LVT0); - apic_write(APIC_LVT0, v | APIC_LVT_MASKED); -} - -static void unmask_lapic_irq(unsigned int irq) -{ - unsigned long v; - - v = apic_read(APIC_LVT0); - apic_write(APIC_LVT0, v & ~APIC_LVT_MASKED); -} - -static void ack_lapic_irq(unsigned int irq) -{ - ack_APIC_irq(); -} - -static struct irq_chip lapic_chip __read_mostly = { - .name = "local-APIC", - .mask = mask_lapic_irq, - .unmask = unmask_lapic_irq, - .ack = ack_lapic_irq, -}; - -static void lapic_register_intr(int irq, struct irq_desc *desc) -{ - desc->status &= ~IRQ_LEVEL; - set_irq_chip_and_handler_name(irq, &lapic_chip, handle_edge_irq, - "edge"); -} - -static void __init setup_nmi(void) -{ - /* - * Dirty trick to enable the NMI watchdog ... - * We put the 8259A master into AEOI mode and - * unmask on all local APICs LVT0 as NMI. - * - * The idea to use the 8259A in AEOI mode ('8259A Virtual Wire') - * is from Maciej W. Rozycki - so we do not have to EOI from - * the NMI handler or the timer interrupt. - */ - apic_printk(APIC_VERBOSE, KERN_INFO "activating NMI Watchdog ..."); - - enable_NMI_through_LVT0(); - - apic_printk(APIC_VERBOSE, " done.\n"); -} - -/* - * This looks a bit hackish but it's about the only one way of sending - * a few INTA cycles to 8259As and any associated glue logic. ICR does - * not support the ExtINT mode, unfortunately. We need to send these - * cycles as some i82489DX-based boards have glue logic that keeps the - * 8259A interrupt line asserted until INTA. --macro - */ -static inline void __init unlock_ExtINT_logic(void) -{ - int apic, pin, i; - struct IO_APIC_route_entry entry0, entry1; - unsigned char save_control, save_freq_select; - - pin = find_isa_irq_pin(8, mp_INT); - if (pin == -1) { - WARN_ON_ONCE(1); - return; - } - apic = find_isa_irq_apic(8, mp_INT); - if (apic == -1) { - WARN_ON_ONCE(1); - return; - } - - entry0 = ioapic_read_entry(apic, pin); - clear_IO_APIC_pin(apic, pin); - - memset(&entry1, 0, sizeof(entry1)); - - entry1.dest_mode = 0; /* physical delivery */ - entry1.mask = 0; /* unmask IRQ now */ - entry1.dest = hard_smp_processor_id(); - entry1.delivery_mode = dest_ExtINT; - entry1.polarity = entry0.polarity; - entry1.trigger = 0; - entry1.vector = 0; - - ioapic_write_entry(apic, pin, entry1); - - save_control = CMOS_READ(RTC_CONTROL); - save_freq_select = CMOS_READ(RTC_FREQ_SELECT); - CMOS_WRITE((save_freq_select & ~RTC_RATE_SELECT) | 0x6, - RTC_FREQ_SELECT); - CMOS_WRITE(save_control | RTC_PIE, RTC_CONTROL); - - i = 100; - while (i-- > 0) { - mdelay(10); - if ((CMOS_READ(RTC_INTR_FLAGS) & RTC_PF) == RTC_PF) - i -= 10; - } - - CMOS_WRITE(save_control, RTC_CONTROL); - CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT); - clear_IO_APIC_pin(apic, pin); - - ioapic_write_entry(apic, pin, entry0); -} - -static int disable_timer_pin_1 __initdata; -/* Actually the next is obsolete, but keep it for paranoid reasons -AK */ -static int __init disable_timer_pin_setup(char *arg) -{ - disable_timer_pin_1 = 1; - return 0; -} -early_param("disable_timer_pin_1", disable_timer_pin_setup); - -int timer_through_8259 __initdata; - -/* - * This code may look a bit paranoid, but it's supposed to cooperate with - * a wide range of boards and BIOS bugs. Fortunately only the timer IRQ - * is so screwy. Thanks to Brian Perkins for testing/hacking this beast - * fanatically on his truly buggy board. - * - * FIXME: really need to revamp this for all platforms. - */ -static inline void __init check_timer(void) -{ - struct irq_desc *desc = irq_to_desc(0); - struct irq_cfg *cfg = desc->chip_data; - int cpu = boot_cpu_id; - int apic1, pin1, apic2, pin2; - unsigned long flags; - unsigned int ver; - int no_pin1 = 0; - - local_irq_save(flags); - - ver = apic_read(APIC_LVR); - ver = GET_APIC_VERSION(ver); - - /* - * get/set the timer IRQ vector: - */ - disable_8259A_irq(0); - assign_irq_vector(0, cfg, TARGET_CPUS); - - /* - * As IRQ0 is to be enabled in the 8259A, the virtual - * wire has to be disabled in the local APIC. Also - * timer interrupts need to be acknowledged manually in - * the 8259A for the i82489DX when using the NMI - * watchdog as that APIC treats NMIs as level-triggered. - * The AEOI mode will finish them in the 8259A - * automatically. - */ - apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT); - init_8259A(1); -#ifdef CONFIG_X86_32 - timer_ack = (nmi_watchdog == NMI_IO_APIC && !APIC_INTEGRATED(ver)); -#endif - - pin1 = find_isa_irq_pin(0, mp_INT); - apic1 = find_isa_irq_apic(0, mp_INT); - pin2 = ioapic_i8259.pin; - apic2 = ioapic_i8259.apic; - - apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X " - "apic1=%d pin1=%d apic2=%d pin2=%d\n", - cfg->vector, apic1, pin1, apic2, pin2); - - /* - * Some BIOS writers are clueless and report the ExtINTA - * I/O APIC input from the cascaded 8259A as the timer - * interrupt input. So just in case, if only one pin - * was found above, try it both directly and through the - * 8259A. - */ - if (pin1 == -1) { -#ifdef CONFIG_INTR_REMAP - if (intr_remapping_enabled) - panic("BIOS bug: timer not connected to IO-APIC"); -#endif - pin1 = pin2; - apic1 = apic2; - no_pin1 = 1; - } else if (pin2 == -1) { - pin2 = pin1; - apic2 = apic1; - } - - if (pin1 != -1) { - /* - * Ok, does IRQ0 through the IOAPIC work? - */ - if (no_pin1) { - add_pin_to_irq_cpu(cfg, cpu, apic1, pin1); - setup_timer_IRQ0_pin(apic1, pin1, cfg->vector); - } - unmask_IO_APIC_irq_desc(desc); - if (timer_irq_works()) { - if (nmi_watchdog == NMI_IO_APIC) { - setup_nmi(); - enable_8259A_irq(0); - } - if (disable_timer_pin_1 > 0) - clear_IO_APIC_pin(0, pin1); - goto out; - } -#ifdef CONFIG_INTR_REMAP - if (intr_remapping_enabled) - panic("timer doesn't work through Interrupt-remapped IO-APIC"); -#endif - clear_IO_APIC_pin(apic1, pin1); - if (!no_pin1) - apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: " - "8254 timer not connected to IO-APIC\n"); - - apic_printk(APIC_QUIET, KERN_INFO "...trying to set up timer " - "(IRQ0) through the 8259A ...\n"); - apic_printk(APIC_QUIET, KERN_INFO - "..... (found apic %d pin %d) ...\n", apic2, pin2); - /* - * legacy devices should be connected to IO APIC #0 - */ - replace_pin_at_irq_cpu(cfg, cpu, apic1, pin1, apic2, pin2); - setup_timer_IRQ0_pin(apic2, pin2, cfg->vector); - unmask_IO_APIC_irq_desc(desc); - enable_8259A_irq(0); - if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "....... works.\n"); - timer_through_8259 = 1; - if (nmi_watchdog == NMI_IO_APIC) { - disable_8259A_irq(0); - setup_nmi(); - enable_8259A_irq(0); - } - goto out; - } - /* - * Cleanup, just in case ... - */ - disable_8259A_irq(0); - clear_IO_APIC_pin(apic2, pin2); - apic_printk(APIC_QUIET, KERN_INFO "....... failed.\n"); - } - - if (nmi_watchdog == NMI_IO_APIC) { - apic_printk(APIC_QUIET, KERN_WARNING "timer doesn't work " - "through the IO-APIC - disabling NMI Watchdog!\n"); - nmi_watchdog = NMI_NONE; - } -#ifdef CONFIG_X86_32 - timer_ack = 0; -#endif - - apic_printk(APIC_QUIET, KERN_INFO - "...trying to set up timer as Virtual Wire IRQ...\n"); - - lapic_register_intr(0, desc); - apic_write(APIC_LVT0, APIC_DM_FIXED | cfg->vector); /* Fixed mode */ - enable_8259A_irq(0); - - if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); - goto out; - } - disable_8259A_irq(0); - apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector); - apic_printk(APIC_QUIET, KERN_INFO "..... failed.\n"); - - apic_printk(APIC_QUIET, KERN_INFO - "...trying to set up timer as ExtINT IRQ...\n"); - - init_8259A(0); - make_8259A_irq(0); - apic_write(APIC_LVT0, APIC_DM_EXTINT); - - unlock_ExtINT_logic(); - - if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); - goto out; - } - apic_printk(APIC_QUIET, KERN_INFO "..... failed :(.\n"); - panic("IO-APIC + timer doesn't work! Boot with apic=debug and send a " - "report. Then try booting with the 'noapic' option.\n"); -out: - local_irq_restore(flags); -} - -/* - * Traditionally ISA IRQ2 is the cascade IRQ, and is not available - * to devices. However there may be an I/O APIC pin available for - * this interrupt regardless. The pin may be left unconnected, but - * typically it will be reused as an ExtINT cascade interrupt for - * the master 8259A. In the MPS case such a pin will normally be - * reported as an ExtINT interrupt in the MP table. With ACPI - * there is no provision for ExtINT interrupts, and in the absence - * of an override it would be treated as an ordinary ISA I/O APIC - * interrupt, that is edge-triggered and unmasked by default. We - * used to do this, but it caused problems on some systems because - * of the NMI watchdog and sometimes IRQ0 of the 8254 timer using - * the same ExtINT cascade interrupt to drive the local APIC of the - * bootstrap processor. Therefore we refrain from routing IRQ2 to - * the I/O APIC in all cases now. No actual device should request - * it anyway. --macro - */ -#define PIC_IRQS (1 << PIC_CASCADE_IR) - -void __init setup_IO_APIC(void) -{ - -#ifdef CONFIG_X86_32 - enable_IO_APIC(); -#else - /* - * calling enable_IO_APIC() is moved to setup_local_APIC for BP - */ -#endif - - io_apic_irqs = ~PIC_IRQS; - - apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n"); - /* - * Set up IO-APIC IRQ routing. - */ -#ifdef CONFIG_X86_32 - if (!acpi_ioapic) - setup_ioapic_ids_from_mpc(); -#endif - sync_Arb_IDs(); - setup_IO_APIC_irqs(); - init_IO_APIC_traps(); - check_timer(); -} - -/* - * Called after all the initialization is done. If we didnt find any - * APIC bugs then we can allow the modify fast path - */ - -static int __init io_apic_bug_finalize(void) -{ - if (sis_apic_bug == -1) - sis_apic_bug = 0; - return 0; -} - -late_initcall(io_apic_bug_finalize); - -struct sysfs_ioapic_data { - struct sys_device dev; - struct IO_APIC_route_entry entry[0]; -}; -static struct sysfs_ioapic_data * mp_ioapic_data[MAX_IO_APICS]; - -static int ioapic_suspend(struct sys_device *dev, pm_message_t state) -{ - struct IO_APIC_route_entry *entry; - struct sysfs_ioapic_data *data; - int i; - - data = container_of(dev, struct sysfs_ioapic_data, dev); - entry = data->entry; - for (i = 0; i < nr_ioapic_registers[dev->id]; i ++, entry ++ ) - *entry = ioapic_read_entry(dev->id, i); - - return 0; -} - -static int ioapic_resume(struct sys_device *dev) -{ - struct IO_APIC_route_entry *entry; - struct sysfs_ioapic_data *data; - unsigned long flags; - union IO_APIC_reg_00 reg_00; - int i; - - data = container_of(dev, struct sysfs_ioapic_data, dev); - entry = data->entry; - - spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(dev->id, 0); - if (reg_00.bits.ID != mp_ioapics[dev->id].mp_apicid) { - reg_00.bits.ID = mp_ioapics[dev->id].mp_apicid; - io_apic_write(dev->id, 0, reg_00.raw); - } - spin_unlock_irqrestore(&ioapic_lock, flags); - for (i = 0; i < nr_ioapic_registers[dev->id]; i++) - ioapic_write_entry(dev->id, i, entry[i]); - - return 0; -} - -static struct sysdev_class ioapic_sysdev_class = { - .name = "ioapic", - .suspend = ioapic_suspend, - .resume = ioapic_resume, -}; - -static int __init ioapic_init_sysfs(void) -{ - struct sys_device * dev; - int i, size, error; - - error = sysdev_class_register(&ioapic_sysdev_class); - if (error) - return error; - - for (i = 0; i < nr_ioapics; i++ ) { - size = sizeof(struct sys_device) + nr_ioapic_registers[i] - * sizeof(struct IO_APIC_route_entry); - mp_ioapic_data[i] = kzalloc(size, GFP_KERNEL); - if (!mp_ioapic_data[i]) { - printk(KERN_ERR "Can't suspend/resume IOAPIC %d\n", i); - continue; - } - dev = &mp_ioapic_data[i]->dev; - dev->id = i; - dev->cls = &ioapic_sysdev_class; - error = sysdev_register(dev); - if (error) { - kfree(mp_ioapic_data[i]); - mp_ioapic_data[i] = NULL; - printk(KERN_ERR "Can't suspend/resume IOAPIC %d\n", i); - continue; - } - } - - return 0; -} - -device_initcall(ioapic_init_sysfs); - -/* - * Dynamic irq allocate and deallocation - */ -unsigned int create_irq_nr(unsigned int irq_want) -{ - /* Allocate an unused irq */ - unsigned int irq; - unsigned int new; - unsigned long flags; - struct irq_cfg *cfg_new = NULL; - int cpu = boot_cpu_id; - struct irq_desc *desc_new = NULL; - - irq = 0; - spin_lock_irqsave(&vector_lock, flags); - for (new = irq_want; new < NR_IRQS; new++) { - if (platform_legacy_irq(new)) - continue; - - desc_new = irq_to_desc_alloc_cpu(new, cpu); - if (!desc_new) { - printk(KERN_INFO "can not get irq_desc for %d\n", new); - continue; - } - cfg_new = desc_new->chip_data; - - if (cfg_new->vector != 0) - continue; - if (__assign_irq_vector(new, cfg_new, TARGET_CPUS) == 0) - irq = new; - break; - } - spin_unlock_irqrestore(&vector_lock, flags); - - if (irq > 0) { - dynamic_irq_init(irq); - /* restore it, in case dynamic_irq_init clear it */ - if (desc_new) - desc_new->chip_data = cfg_new; - } - return irq; -} - -static int nr_irqs_gsi = NR_IRQS_LEGACY; -int create_irq(void) -{ - unsigned int irq_want; - int irq; - - irq_want = nr_irqs_gsi; - irq = create_irq_nr(irq_want); - - if (irq == 0) - irq = -1; - - return irq; -} - -void destroy_irq(unsigned int irq) -{ - unsigned long flags; - struct irq_cfg *cfg; - struct irq_desc *desc; - - /* store it, in case dynamic_irq_cleanup clear it */ - desc = irq_to_desc(irq); - cfg = desc->chip_data; - dynamic_irq_cleanup(irq); - /* connect back irq_cfg */ - if (desc) - desc->chip_data = cfg; - -#ifdef CONFIG_INTR_REMAP - free_irte(irq); -#endif - spin_lock_irqsave(&vector_lock, flags); - __clear_irq_vector(irq, cfg); - spin_unlock_irqrestore(&vector_lock, flags); -} - -/* - * MSI message composition - */ -#ifdef CONFIG_PCI_MSI -static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_msg *msg) -{ - struct irq_cfg *cfg; - int err; - unsigned dest; - - cfg = irq_cfg(irq); - err = assign_irq_vector(irq, cfg, TARGET_CPUS); - if (err) - return err; - - dest = cpu_mask_to_apicid_and(cfg->domain, TARGET_CPUS); - -#ifdef CONFIG_INTR_REMAP - if (irq_remapped(irq)) { - struct irte irte; - int ir_index; - u16 sub_handle; - - ir_index = map_irq_to_irte_handle(irq, &sub_handle); - BUG_ON(ir_index == -1); - - memset (&irte, 0, sizeof(irte)); - - irte.present = 1; - irte.dst_mode = INT_DEST_MODE; - irte.trigger_mode = 0; /* edge */ - irte.dlvry_mode = INT_DELIVERY_MODE; - irte.vector = cfg->vector; - irte.dest_id = IRTE_DEST(dest); - - modify_irte(irq, &irte); - - msg->address_hi = MSI_ADDR_BASE_HI; - msg->data = sub_handle; - msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT | - MSI_ADDR_IR_SHV | - MSI_ADDR_IR_INDEX1(ir_index) | - MSI_ADDR_IR_INDEX2(ir_index); - } else -#endif - { - msg->address_hi = MSI_ADDR_BASE_HI; - msg->address_lo = - MSI_ADDR_BASE_LO | - ((INT_DEST_MODE == 0) ? - MSI_ADDR_DEST_MODE_PHYSICAL: - MSI_ADDR_DEST_MODE_LOGICAL) | - ((INT_DELIVERY_MODE != dest_LowestPrio) ? - MSI_ADDR_REDIRECTION_CPU: - MSI_ADDR_REDIRECTION_LOWPRI) | - MSI_ADDR_DEST_ID(dest); - - msg->data = - MSI_DATA_TRIGGER_EDGE | - MSI_DATA_LEVEL_ASSERT | - ((INT_DELIVERY_MODE != dest_LowestPrio) ? - MSI_DATA_DELIVERY_FIXED: - MSI_DATA_DELIVERY_LOWPRI) | - MSI_DATA_VECTOR(cfg->vector); - } - return err; -} - -#ifdef CONFIG_SMP -static void set_msi_irq_affinity(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irq_cfg *cfg; - struct msi_msg msg; - unsigned int dest; - - dest = set_desc_affinity(desc, mask); - if (dest == BAD_APICID) - return; - - cfg = desc->chip_data; - - read_msi_msg_desc(desc, &msg); - - msg.data &= ~MSI_DATA_VECTOR_MASK; - msg.data |= MSI_DATA_VECTOR(cfg->vector); - msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; - msg.address_lo |= MSI_ADDR_DEST_ID(dest); - - write_msi_msg_desc(desc, &msg); -} -#ifdef CONFIG_INTR_REMAP -/* - * Migrate the MSI irq to another cpumask. This migration is - * done in the process context using interrupt-remapping hardware. - */ -static void -ir_set_msi_irq_affinity(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irq_cfg *cfg = desc->chip_data; - unsigned int dest; - struct irte irte; - - if (get_irte(irq, &irte)) - return; - - dest = set_desc_affinity(desc, mask); - if (dest == BAD_APICID) - return; - - irte.vector = cfg->vector; - irte.dest_id = IRTE_DEST(dest); - - /* - * atomically update the IRTE with the new destination and vector. - */ - modify_irte(irq, &irte); - - /* - * After this point, all the interrupts will start arriving - * at the new destination. So, time to cleanup the previous - * vector allocation. - */ - if (cfg->move_in_progress) - send_cleanup_vector(cfg); -} - -#endif -#endif /* CONFIG_SMP */ - -/* - * IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices, - * which implement the MSI or MSI-X Capability Structure. - */ -static struct irq_chip msi_chip = { - .name = "PCI-MSI", - .unmask = unmask_msi_irq, - .mask = mask_msi_irq, - .ack = ack_apic_edge, -#ifdef CONFIG_SMP - .set_affinity = set_msi_irq_affinity, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -#ifdef CONFIG_INTR_REMAP -static struct irq_chip msi_ir_chip = { - .name = "IR-PCI-MSI", - .unmask = unmask_msi_irq, - .mask = mask_msi_irq, - .ack = ack_x2apic_edge, -#ifdef CONFIG_SMP - .set_affinity = ir_set_msi_irq_affinity, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -/* - * Map the PCI dev to the corresponding remapping hardware unit - * and allocate 'nvec' consecutive interrupt-remapping table entries - * in it. - */ -static int msi_alloc_irte(struct pci_dev *dev, int irq, int nvec) -{ - struct intel_iommu *iommu; - int index; - - iommu = map_dev_to_ir(dev); - if (!iommu) { - printk(KERN_ERR - "Unable to map PCI %s to iommu\n", pci_name(dev)); - return -ENOENT; - } - - index = alloc_irte(iommu, irq, nvec); - if (index < 0) { - printk(KERN_ERR - "Unable to allocate %d IRTE for PCI %s\n", nvec, - pci_name(dev)); - return -ENOSPC; - } - return index; -} -#endif - -static int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc, int irq) -{ - int ret; - struct msi_msg msg; - - ret = msi_compose_msg(dev, irq, &msg); - if (ret < 0) - return ret; - - set_irq_msi(irq, msidesc); - write_msi_msg(irq, &msg); - -#ifdef CONFIG_INTR_REMAP - if (irq_remapped(irq)) { - struct irq_desc *desc = irq_to_desc(irq); - /* - * irq migration in process context - */ - desc->status |= IRQ_MOVE_PCNTXT; - set_irq_chip_and_handler_name(irq, &msi_ir_chip, handle_edge_irq, "edge"); - } else -#endif - set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge"); - - dev_printk(KERN_DEBUG, &dev->dev, "irq %d for MSI/MSI-X\n", irq); - - return 0; -} - -int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc) -{ - unsigned int irq; - int ret; - unsigned int irq_want; - - irq_want = nr_irqs_gsi; - irq = create_irq_nr(irq_want); - if (irq == 0) - return -1; - -#ifdef CONFIG_INTR_REMAP - if (!intr_remapping_enabled) - goto no_ir; - - ret = msi_alloc_irte(dev, irq, 1); - if (ret < 0) - goto error; -no_ir: -#endif - ret = setup_msi_irq(dev, msidesc, irq); - if (ret < 0) { - destroy_irq(irq); - return ret; - } - return 0; - -#ifdef CONFIG_INTR_REMAP -error: - destroy_irq(irq); - return ret; -#endif -} - -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) -{ - unsigned int irq; - int ret, sub_handle; - struct msi_desc *msidesc; - unsigned int irq_want; - -#ifdef CONFIG_INTR_REMAP - struct intel_iommu *iommu = 0; - int index = 0; -#endif - - irq_want = nr_irqs_gsi; - sub_handle = 0; - list_for_each_entry(msidesc, &dev->msi_list, list) { - irq = create_irq_nr(irq_want); - irq_want++; - if (irq == 0) - return -1; -#ifdef CONFIG_INTR_REMAP - if (!intr_remapping_enabled) - goto no_ir; - - if (!sub_handle) { - /* - * allocate the consecutive block of IRTE's - * for 'nvec' - */ - index = msi_alloc_irte(dev, irq, nvec); - if (index < 0) { - ret = index; - goto error; - } - } else { - iommu = map_dev_to_ir(dev); - if (!iommu) { - ret = -ENOENT; - goto error; - } - /* - * setup the mapping between the irq and the IRTE - * base index, the sub_handle pointing to the - * appropriate interrupt remap table entry. - */ - set_irte_irq(irq, iommu, index, sub_handle); - } -no_ir: -#endif - ret = setup_msi_irq(dev, msidesc, irq); - if (ret < 0) - goto error; - sub_handle++; - } - return 0; - -error: - destroy_irq(irq); - return ret; -} - -void arch_teardown_msi_irq(unsigned int irq) -{ - destroy_irq(irq); -} - -#ifdef CONFIG_DMAR -#ifdef CONFIG_SMP -static void dmar_msi_set_affinity(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irq_cfg *cfg; - struct msi_msg msg; - unsigned int dest; - - dest = set_desc_affinity(desc, mask); - if (dest == BAD_APICID) - return; - - cfg = desc->chip_data; - - dmar_msi_read(irq, &msg); - - msg.data &= ~MSI_DATA_VECTOR_MASK; - msg.data |= MSI_DATA_VECTOR(cfg->vector); - msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; - msg.address_lo |= MSI_ADDR_DEST_ID(dest); - - dmar_msi_write(irq, &msg); -} - -#endif /* CONFIG_SMP */ - -struct irq_chip dmar_msi_type = { - .name = "DMAR_MSI", - .unmask = dmar_msi_unmask, - .mask = dmar_msi_mask, - .ack = ack_apic_edge, -#ifdef CONFIG_SMP - .set_affinity = dmar_msi_set_affinity, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -int arch_setup_dmar_msi(unsigned int irq) -{ - int ret; - struct msi_msg msg; - - ret = msi_compose_msg(NULL, irq, &msg); - if (ret < 0) - return ret; - dmar_msi_write(irq, &msg); - set_irq_chip_and_handler_name(irq, &dmar_msi_type, handle_edge_irq, - "edge"); - return 0; -} -#endif - -#ifdef CONFIG_HPET_TIMER - -#ifdef CONFIG_SMP -static void hpet_msi_set_affinity(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irq_cfg *cfg; - struct msi_msg msg; - unsigned int dest; - - dest = set_desc_affinity(desc, mask); - if (dest == BAD_APICID) - return; - - cfg = desc->chip_data; - - hpet_msi_read(irq, &msg); - - msg.data &= ~MSI_DATA_VECTOR_MASK; - msg.data |= MSI_DATA_VECTOR(cfg->vector); - msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; - msg.address_lo |= MSI_ADDR_DEST_ID(dest); - - hpet_msi_write(irq, &msg); -} - -#endif /* CONFIG_SMP */ - -struct irq_chip hpet_msi_type = { - .name = "HPET_MSI", - .unmask = hpet_msi_unmask, - .mask = hpet_msi_mask, - .ack = ack_apic_edge, -#ifdef CONFIG_SMP - .set_affinity = hpet_msi_set_affinity, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -int arch_setup_hpet_msi(unsigned int irq) -{ - int ret; - struct msi_msg msg; - - ret = msi_compose_msg(NULL, irq, &msg); - if (ret < 0) - return ret; - - hpet_msi_write(irq, &msg); - set_irq_chip_and_handler_name(irq, &hpet_msi_type, handle_edge_irq, - "edge"); - - return 0; -} -#endif - -#endif /* CONFIG_PCI_MSI */ -/* - * Hypertransport interrupt support - */ -#ifdef CONFIG_HT_IRQ - -#ifdef CONFIG_SMP - -static void target_ht_irq(unsigned int irq, unsigned int dest, u8 vector) -{ - struct ht_irq_msg msg; - fetch_ht_irq_msg(irq, &msg); - - msg.address_lo &= ~(HT_IRQ_LOW_VECTOR_MASK | HT_IRQ_LOW_DEST_ID_MASK); - msg.address_hi &= ~(HT_IRQ_HIGH_DEST_ID_MASK); - - msg.address_lo |= HT_IRQ_LOW_VECTOR(vector) | HT_IRQ_LOW_DEST_ID(dest); - msg.address_hi |= HT_IRQ_HIGH_DEST_ID(dest); - - write_ht_irq_msg(irq, &msg); -} - -static void set_ht_irq_affinity(unsigned int irq, const struct cpumask *mask) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irq_cfg *cfg; - unsigned int dest; - - dest = set_desc_affinity(desc, mask); - if (dest == BAD_APICID) - return; - - cfg = desc->chip_data; - - target_ht_irq(irq, dest, cfg->vector); -} - -#endif - -static struct irq_chip ht_irq_chip = { - .name = "PCI-HT", - .mask = mask_ht_irq, - .unmask = unmask_ht_irq, - .ack = ack_apic_edge, -#ifdef CONFIG_SMP - .set_affinity = set_ht_irq_affinity, -#endif - .retrigger = ioapic_retrigger_irq, -}; - -int arch_setup_ht_irq(unsigned int irq, struct pci_dev *dev) -{ - struct irq_cfg *cfg; - int err; - - cfg = irq_cfg(irq); - err = assign_irq_vector(irq, cfg, TARGET_CPUS); - if (!err) { - struct ht_irq_msg msg; - unsigned dest; - - dest = cpu_mask_to_apicid_and(cfg->domain, TARGET_CPUS); - - msg.address_hi = HT_IRQ_HIGH_DEST_ID(dest); - - msg.address_lo = - HT_IRQ_LOW_BASE | - HT_IRQ_LOW_DEST_ID(dest) | - HT_IRQ_LOW_VECTOR(cfg->vector) | - ((INT_DEST_MODE == 0) ? - HT_IRQ_LOW_DM_PHYSICAL : - HT_IRQ_LOW_DM_LOGICAL) | - HT_IRQ_LOW_RQEOI_EDGE | - ((INT_DELIVERY_MODE != dest_LowestPrio) ? - HT_IRQ_LOW_MT_FIXED : - HT_IRQ_LOW_MT_ARBITRATED) | - HT_IRQ_LOW_IRQ_MASKED; - - write_ht_irq_msg(irq, &msg); - - set_irq_chip_and_handler_name(irq, &ht_irq_chip, - handle_edge_irq, "edge"); - - dev_printk(KERN_DEBUG, &dev->dev, "irq %d for HT\n", irq); - } - return err; -} -#endif /* CONFIG_HT_IRQ */ - -#ifdef CONFIG_X86_64 -/* - * Re-target the irq to the specified CPU and enable the specified MMR located - * on the specified blade to allow the sending of MSIs to the specified CPU. - */ -int arch_enable_uv_irq(char *irq_name, unsigned int irq, int cpu, int mmr_blade, - unsigned long mmr_offset) -{ - const struct cpumask *eligible_cpu = cpumask_of(cpu); - struct irq_cfg *cfg; - int mmr_pnode; - unsigned long mmr_value; - struct uv_IO_APIC_route_entry *entry; - unsigned long flags; - int err; - - cfg = irq_cfg(irq); - - err = assign_irq_vector(irq, cfg, eligible_cpu); - if (err != 0) - return err; - - spin_lock_irqsave(&vector_lock, flags); - set_irq_chip_and_handler_name(irq, &uv_irq_chip, handle_percpu_irq, - irq_name); - spin_unlock_irqrestore(&vector_lock, flags); - - mmr_value = 0; - entry = (struct uv_IO_APIC_route_entry *)&mmr_value; - BUG_ON(sizeof(struct uv_IO_APIC_route_entry) != sizeof(unsigned long)); - - entry->vector = cfg->vector; - entry->delivery_mode = INT_DELIVERY_MODE; - entry->dest_mode = INT_DEST_MODE; - entry->polarity = 0; - entry->trigger = 0; - entry->mask = 0; - entry->dest = cpu_mask_to_apicid(eligible_cpu); - - mmr_pnode = uv_blade_to_pnode(mmr_blade); - uv_write_global_mmr64(mmr_pnode, mmr_offset, mmr_value); - - return irq; -} - -/* - * Disable the specified MMR located on the specified blade so that MSIs are - * longer allowed to be sent. - */ -void arch_disable_uv_irq(int mmr_blade, unsigned long mmr_offset) -{ - unsigned long mmr_value; - struct uv_IO_APIC_route_entry *entry; - int mmr_pnode; - - mmr_value = 0; - entry = (struct uv_IO_APIC_route_entry *)&mmr_value; - BUG_ON(sizeof(struct uv_IO_APIC_route_entry) != sizeof(unsigned long)); - - entry->mask = 1; - - mmr_pnode = uv_blade_to_pnode(mmr_blade); - uv_write_global_mmr64(mmr_pnode, mmr_offset, mmr_value); -} -#endif /* CONFIG_X86_64 */ - -int __init io_apic_get_redir_entries (int ioapic) -{ - union IO_APIC_reg_01 reg_01; - unsigned long flags; - - spin_lock_irqsave(&ioapic_lock, flags); - reg_01.raw = io_apic_read(ioapic, 1); - spin_unlock_irqrestore(&ioapic_lock, flags); - - return reg_01.bits.entries; -} - -void __init probe_nr_irqs_gsi(void) -{ - int nr = 0; - - nr = acpi_probe_gsi(); - if (nr > nr_irqs_gsi) { - nr_irqs_gsi = nr; - } else { - /* for acpi=off or acpi is not compiled in */ - int idx; - - nr = 0; - for (idx = 0; idx < nr_ioapics; idx++) - nr += io_apic_get_redir_entries(idx) + 1; - - if (nr > nr_irqs_gsi) - nr_irqs_gsi = nr; - } - - printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi); -} - -/* -------------------------------------------------------------------------- - ACPI-based IOAPIC Configuration - -------------------------------------------------------------------------- */ - -#ifdef CONFIG_ACPI - -#ifdef CONFIG_X86_32 -int __init io_apic_get_unique_id(int ioapic, int apic_id) -{ - union IO_APIC_reg_00 reg_00; - static physid_mask_t apic_id_map = PHYSID_MASK_NONE; - physid_mask_t tmp; - unsigned long flags; - int i = 0; - - /* - * The P4 platform supports up to 256 APIC IDs on two separate APIC - * buses (one for LAPICs, one for IOAPICs), where predecessors only - * supports up to 16 on one shared APIC bus. - * - * TBD: Expand LAPIC/IOAPIC support on P4-class systems to take full - * advantage of new APIC bus architecture. - */ - - if (physids_empty(apic_id_map)) - apic_id_map = ioapic_phys_id_map(phys_cpu_present_map); - - spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(ioapic, 0); - spin_unlock_irqrestore(&ioapic_lock, flags); - - if (apic_id >= get_physical_broadcast()) { - printk(KERN_WARNING "IOAPIC[%d]: Invalid apic_id %d, trying " - "%d\n", ioapic, apic_id, reg_00.bits.ID); - apic_id = reg_00.bits.ID; - } - - /* - * Every APIC in a system must have a unique ID or we get lots of nice - * 'stuck on smp_invalidate_needed IPI wait' messages. - */ - if (check_apicid_used(apic_id_map, apic_id)) { - - for (i = 0; i < get_physical_broadcast(); i++) { - if (!check_apicid_used(apic_id_map, i)) - break; - } - - if (i == get_physical_broadcast()) - panic("Max apic_id exceeded!\n"); - - printk(KERN_WARNING "IOAPIC[%d]: apic_id %d already used, " - "trying %d\n", ioapic, apic_id, i); - - apic_id = i; - } - - tmp = apicid_to_cpu_present(apic_id); - physids_or(apic_id_map, apic_id_map, tmp); - - if (reg_00.bits.ID != apic_id) { - reg_00.bits.ID = apic_id; - - spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(ioapic, 0, reg_00.raw); - reg_00.raw = io_apic_read(ioapic, 0); - spin_unlock_irqrestore(&ioapic_lock, flags); - - /* Sanity check */ - if (reg_00.bits.ID != apic_id) { - printk("IOAPIC[%d]: Unable to change apic_id!\n", ioapic); - return -1; - } - } - - apic_printk(APIC_VERBOSE, KERN_INFO - "IOAPIC[%d]: Assigned apic_id %d\n", ioapic, apic_id); - - return apic_id; -} - -int __init io_apic_get_version(int ioapic) -{ - union IO_APIC_reg_01 reg_01; - unsigned long flags; - - spin_lock_irqsave(&ioapic_lock, flags); - reg_01.raw = io_apic_read(ioapic, 1); - spin_unlock_irqrestore(&ioapic_lock, flags); - - return reg_01.bits.version; -} -#endif - -int io_apic_set_pci_routing (int ioapic, int pin, int irq, int triggering, int polarity) -{ - struct irq_desc *desc; - struct irq_cfg *cfg; - int cpu = boot_cpu_id; - - if (!IO_APIC_IRQ(irq)) { - apic_printk(APIC_QUIET,KERN_ERR "IOAPIC[%d]: Invalid reference to IRQ 0\n", - ioapic); - return -EINVAL; - } - - desc = irq_to_desc_alloc_cpu(irq, cpu); - if (!desc) { - printk(KERN_INFO "can not get irq_desc %d\n", irq); - return 0; - } - - /* - * IRQs < 16 are already in the irq_2_pin[] map - */ - if (irq >= NR_IRQS_LEGACY) { - cfg = desc->chip_data; - add_pin_to_irq_cpu(cfg, cpu, ioapic, pin); - } - - setup_IO_APIC_irq(ioapic, pin, irq, desc, triggering, polarity); - - return 0; -} - - -int acpi_get_override_irq(int bus_irq, int *trigger, int *polarity) -{ - int i; - - if (skip_ioapic_setup) - return -1; - - for (i = 0; i < mp_irq_entries; i++) - if (mp_irqs[i].mp_irqtype == mp_INT && - mp_irqs[i].mp_srcbusirq == bus_irq) - break; - if (i >= mp_irq_entries) - return -1; - - *trigger = irq_trigger(i); - *polarity = irq_polarity(i); - return 0; -} - -#endif /* CONFIG_ACPI */ - -/* - * This function currently is only a helper for the i386 smp boot process where - * we need to reprogram the ioredtbls to cater for the cpus which have come online - * so mask in all cases should simply be TARGET_CPUS - */ -#ifdef CONFIG_SMP -void __init setup_ioapic_dest(void) -{ - int pin, ioapic, irq, irq_entry; - struct irq_desc *desc; - struct irq_cfg *cfg; - const struct cpumask *mask; - - if (skip_ioapic_setup == 1) - return; - - for (ioapic = 0; ioapic < nr_ioapics; ioapic++) { - for (pin = 0; pin < nr_ioapic_registers[ioapic]; pin++) { - irq_entry = find_irq_entry(ioapic, pin, mp_INT); - if (irq_entry == -1) - continue; - irq = pin_2_irq(irq_entry, ioapic, pin); - - /* setup_IO_APIC_irqs could fail to get vector for some device - * when you have too many devices, because at that time only boot - * cpu is online. - */ - desc = irq_to_desc(irq); - cfg = desc->chip_data; - if (!cfg->vector) { - setup_IO_APIC_irq(ioapic, pin, irq, desc, - irq_trigger(irq_entry), - irq_polarity(irq_entry)); - continue; - - } - - /* - * Honour affinities which have been set in early boot - */ - if (desc->status & - (IRQ_NO_BALANCING | IRQ_AFFINITY_SET)) - mask = &desc->affinity; - else - mask = TARGET_CPUS; - -#ifdef CONFIG_INTR_REMAP - if (intr_remapping_enabled) - set_ir_ioapic_affinity_irq_desc(desc, mask); - else -#endif - set_ioapic_affinity_irq_desc(desc, mask); - } - - } -} -#endif - -#define IOAPIC_RESOURCE_NAME_SIZE 11 - -static struct resource *ioapic_resources; - -static struct resource * __init ioapic_setup_resources(void) -{ - unsigned long n; - struct resource *res; - char *mem; - int i; - - if (nr_ioapics <= 0) - return NULL; - - n = IOAPIC_RESOURCE_NAME_SIZE + sizeof(struct resource); - n *= nr_ioapics; - - mem = alloc_bootmem(n); - res = (void *)mem; - - if (mem != NULL) { - mem += sizeof(struct resource) * nr_ioapics; - - for (i = 0; i < nr_ioapics; i++) { - res[i].name = mem; - res[i].flags = IORESOURCE_MEM | IORESOURCE_BUSY; - sprintf(mem, "IOAPIC %u", i); - mem += IOAPIC_RESOURCE_NAME_SIZE; - } - } - - ioapic_resources = res; - - return res; -} - -void __init ioapic_init_mappings(void) -{ - unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; - struct resource *ioapic_res; - int i; - - ioapic_res = ioapic_setup_resources(); - for (i = 0; i < nr_ioapics; i++) { - if (smp_found_config) { - ioapic_phys = mp_ioapics[i].mp_apicaddr; -#ifdef CONFIG_X86_32 - if (!ioapic_phys) { - printk(KERN_ERR - "WARNING: bogus zero IO-APIC " - "address found in MPTABLE, " - "disabling IO/APIC support!\n"); - smp_found_config = 0; - skip_ioapic_setup = 1; - goto fake_ioapic_page; - } -#endif - } else { -#ifdef CONFIG_X86_32 -fake_ioapic_page: -#endif - ioapic_phys = (unsigned long) - alloc_bootmem_pages(PAGE_SIZE); - ioapic_phys = __pa(ioapic_phys); - } - set_fixmap_nocache(idx, ioapic_phys); - apic_printk(APIC_VERBOSE, - "mapped IOAPIC to %08lx (%08lx)\n", - __fix_to_virt(idx), ioapic_phys); - idx++; - - if (ioapic_res != NULL) { - ioapic_res->start = ioapic_phys; - ioapic_res->end = ioapic_phys + (4 * 1024) - 1; - ioapic_res++; - } - } -} - -static int __init ioapic_insert_resources(void) -{ - int i; - struct resource *r = ioapic_resources; - - if (!r) { - printk(KERN_ERR - "IO APIC resources could be not be allocated.\n"); - return -1; - } - - for (i = 0; i < nr_ioapics; i++) { - insert_resource(&iomem_resource, r); - r++; - } - - return 0; -} - -/* Insert the IO APIC resources after PCI initialization has occured to handle - * IO APICS that are mapped in on a BAR in PCI space. */ -late_initcall(ioapic_insert_resources); Index: linux-2.6-tip/arch/x86/kernel/ioport.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/ioport.c +++ linux-2.6-tip/arch/x86/kernel/ioport.c @@ -131,9 +131,8 @@ static int do_iopl(unsigned int level, s } #ifdef CONFIG_X86_32 -asmlinkage long sys_iopl(unsigned long regsp) +long sys_iopl(struct pt_regs *regs) { - struct pt_regs *regs = (struct pt_regs *)®sp; unsigned int level = regs->bx; struct thread_struct *t = ¤t->thread; int rc; Index: linux-2.6-tip/arch/x86/kernel/ipi.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/ipi.c +++ /dev/null @@ -1,190 +0,0 @@ -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include - -#ifdef CONFIG_X86_32 -#include -#include - -/* - * the following functions deal with sending IPIs between CPUs. - * - * We use 'broadcast', CPU->CPU IPIs and self-IPIs too. - */ - -static inline int __prepare_ICR(unsigned int shortcut, int vector) -{ - unsigned int icr = shortcut | APIC_DEST_LOGICAL; - - switch (vector) { - default: - icr |= APIC_DM_FIXED | vector; - break; - case NMI_VECTOR: - icr |= APIC_DM_NMI; - break; - } - return icr; -} - -static inline int __prepare_ICR2(unsigned int mask) -{ - return SET_APIC_DEST_FIELD(mask); -} - -void __send_IPI_shortcut(unsigned int shortcut, int vector) -{ - /* - * Subtle. In the case of the 'never do double writes' workaround - * we have to lock out interrupts to be safe. As we don't care - * of the value read we use an atomic rmw access to avoid costly - * cli/sti. Otherwise we use an even cheaper single atomic write - * to the APIC. - */ - unsigned int cfg; - - /* - * Wait for idle. - */ - apic_wait_icr_idle(); - - /* - * No need to touch the target chip field - */ - cfg = __prepare_ICR(shortcut, vector); - - /* - * Send the IPI. The write to APIC_ICR fires this off. - */ - apic_write(APIC_ICR, cfg); -} - -void send_IPI_self(int vector) -{ - __send_IPI_shortcut(APIC_DEST_SELF, vector); -} - -/* - * This is used to send an IPI with no shorthand notation (the destination is - * specified in bits 56 to 63 of the ICR). - */ -static inline void __send_IPI_dest_field(unsigned long mask, int vector) -{ - unsigned long cfg; - - /* - * Wait for idle. - */ - if (unlikely(vector == NMI_VECTOR)) - safe_apic_wait_icr_idle(); - else - apic_wait_icr_idle(); - - /* - * prepare target chip field - */ - cfg = __prepare_ICR2(mask); - apic_write(APIC_ICR2, cfg); - - /* - * program the ICR - */ - cfg = __prepare_ICR(0, vector); - - /* - * Send the IPI. The write to APIC_ICR fires this off. - */ - apic_write(APIC_ICR, cfg); -} - -/* - * This is only used on smaller machines. - */ -void send_IPI_mask_bitmask(const struct cpumask *cpumask, int vector) -{ - unsigned long mask = cpumask_bits(cpumask)[0]; - unsigned long flags; - - local_irq_save(flags); - WARN_ON(mask & ~cpumask_bits(cpu_online_mask)[0]); - __send_IPI_dest_field(mask, vector); - local_irq_restore(flags); -} - -void send_IPI_mask_sequence(const struct cpumask *mask, int vector) -{ - unsigned long flags; - unsigned int query_cpu; - - /* - * Hack. The clustered APIC addressing mode doesn't allow us to send - * to an arbitrary mask, so I do a unicasts to each CPU instead. This - * should be modified to do 1 message per cluster ID - mbligh - */ - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) - __send_IPI_dest_field(cpu_to_logical_apicid(query_cpu), vector); - local_irq_restore(flags); -} - -void send_IPI_mask_allbutself(const struct cpumask *mask, int vector) -{ - unsigned long flags; - unsigned int query_cpu; - unsigned int this_cpu = smp_processor_id(); - - /* See Hack comment above */ - - local_irq_save(flags); - for_each_cpu(query_cpu, mask) - if (query_cpu != this_cpu) - __send_IPI_dest_field(cpu_to_logical_apicid(query_cpu), - vector); - local_irq_restore(flags); -} - -/* must come after the send_IPI functions above for inlining */ -static int convert_apicid_to_cpu(int apic_id) -{ - int i; - - for_each_possible_cpu(i) { - if (per_cpu(x86_cpu_to_apicid, i) == apic_id) - return i; - } - return -1; -} - -int safe_smp_processor_id(void) -{ - int apicid, cpuid; - - if (!boot_cpu_has(X86_FEATURE_APIC)) - return 0; - - apicid = hard_smp_processor_id(); - if (apicid == BAD_APICID) - return 0; - - cpuid = convert_apicid_to_cpu(apicid); - - return cpuid >= 0 ? cpuid : 0; -} -#endif Index: linux-2.6-tip/arch/x86/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/irq.c +++ linux-2.6-tip/arch/x86/kernel/irq.c @@ -6,10 +6,12 @@ #include #include #include +#include #include #include #include +#include atomic_t irq_err_count; @@ -36,11 +38,7 @@ void ack_bad_irq(unsigned int irq) #endif } -#ifdef CONFIG_X86_32 -# define irq_stats(x) (&per_cpu(irq_stat, x)) -#else -# define irq_stats(x) cpu_pda(x) -#endif +#define irq_stats(x) (&per_cpu(irq_stat, x)) /* * /proc/interrupts printing: */ @@ -57,6 +55,10 @@ static int show_other_interrupts(struct for_each_online_cpu(j) seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs); seq_printf(p, " Local timer interrupts\n"); + seq_printf(p, "CNT: "); + for_each_online_cpu(j) + seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs); + seq_printf(p, " Performance counter interrupts\n"); #endif #ifdef CONFIG_SMP seq_printf(p, "RES: "); @@ -164,6 +166,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu) #ifdef CONFIG_X86_LOCAL_APIC sum += irq_stats(cpu)->apic_timer_irqs; + sum += irq_stats(cpu)->apic_perf_irqs; #endif #ifdef CONFIG_SMP sum += irq_stats(cpu)->irq_resched_count; @@ -192,4 +195,40 @@ u64 arch_irq_stat(void) return sum; } + +/* + * do_IRQ handles all normal device IRQ's (the special + * SMP cross-CPU interrupts have their own specific + * handlers). + */ +unsigned int __irq_entry do_IRQ(struct pt_regs *regs) +{ + struct pt_regs *old_regs = set_irq_regs(regs); + + /* high bit used in ret_from_ code */ + unsigned vector = ~regs->orig_ax; + unsigned irq; + + exit_idle(); + irq_enter(); + + irq = __get_cpu_var(vector_irq)[vector]; + + if (!handle_irq(irq, regs)) { +#ifdef CONFIG_X86_64 + if (!disable_apic) + ack_APIC_irq(); +#endif + + if (printk_ratelimit()) + printk(KERN_EMERG "%s: %d.%d No irq handler for vector (irq %d)\n", + __func__, smp_processor_id(), vector, irq); + } + + irq_exit(); + + set_irq_regs(old_regs); + return 1; +} + EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq); Index: linux-2.6-tip/arch/x86/kernel/irq_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/irq_32.c +++ linux-2.6-tip/arch/x86/kernel/irq_32.c @@ -191,33 +191,16 @@ static inline int execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq) { return 0; } #endif -/* - * do_IRQ handles all normal device IRQ's (the special - * SMP cross-CPU interrupts have their own specific - * handlers). - */ -unsigned int do_IRQ(struct pt_regs *regs) +bool handle_irq(unsigned irq, struct pt_regs *regs) { - struct pt_regs *old_regs; - /* high bit used in ret_from_ code */ - int overflow; - unsigned vector = ~regs->orig_ax; struct irq_desc *desc; - unsigned irq; - - - old_regs = set_irq_regs(regs); - irq_enter(); - irq = __get_cpu_var(vector_irq)[vector]; + int overflow; overflow = check_stack_overflow(); desc = irq_to_desc(irq); - if (unlikely(!desc)) { - printk(KERN_EMERG "%s: cannot handle IRQ %d vector %#x cpu %d\n", - __func__, irq, vector, smp_processor_id()); - BUG(); - } + if (unlikely(!desc)) + return false; if (!execute_on_irq_stack(overflow, desc, irq)) { if (unlikely(overflow)) @@ -225,13 +208,10 @@ unsigned int do_IRQ(struct pt_regs *regs desc->handle_irq(irq, desc); } - irq_exit(); - set_irq_regs(old_regs); - return 1; + return true; } #ifdef CONFIG_HOTPLUG_CPU -#include /* A cpu has been removed from cpu_online_mask. Reset irq affinities. */ void fixup_irqs(void) @@ -248,7 +228,7 @@ void fixup_irqs(void) if (irq == 2) continue; - affinity = &desc->affinity; + affinity = desc->affinity; if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) { printk("Breaking affinity for irq %i\n", irq); affinity = cpu_all_mask; Index: linux-2.6-tip/arch/x86/kernel/irq_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/irq_64.c +++ linux-2.6-tip/arch/x86/kernel/irq_64.c @@ -18,6 +18,13 @@ #include #include #include +#include + +DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); +EXPORT_PER_CPU_SYMBOL(irq_stat); + +DEFINE_PER_CPU(struct pt_regs *, irq_regs); +EXPORT_PER_CPU_SYMBOL(irq_regs); /* * Probabilistic stack overflow check: @@ -41,42 +48,18 @@ static inline void stack_overflow_check( #endif } -/* - * do_IRQ handles all normal device IRQ's (the special - * SMP cross-CPU interrupts have their own specific - * handlers). - */ -asmlinkage unsigned int __irq_entry do_IRQ(struct pt_regs *regs) +bool handle_irq(unsigned irq, struct pt_regs *regs) { - struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc *desc; - /* high bit used in ret_from_ code */ - unsigned vector = ~regs->orig_ax; - unsigned irq; - - exit_idle(); - irq_enter(); - irq = __get_cpu_var(vector_irq)[vector]; - stack_overflow_check(regs); desc = irq_to_desc(irq); - if (likely(desc)) - generic_handle_irq_desc(irq, desc); - else { - if (!disable_apic) - ack_APIC_irq(); - - if (printk_ratelimit()) - printk(KERN_EMERG "%s: %d.%d No irq handler for vector\n", - __func__, smp_processor_id(), vector); - } - - irq_exit(); + if (unlikely(!desc)) + return false; - set_irq_regs(old_regs); - return 1; + generic_handle_irq_desc(irq, desc); + return true; } #ifdef CONFIG_HOTPLUG_CPU @@ -100,7 +83,7 @@ void fixup_irqs(void) /* interrupt's are disabled at this point */ spin_lock(&desc->lock); - affinity = &desc->affinity; + affinity = desc->affinity; if (!irq_has_action(irq) || cpumask_equal(affinity, cpu_online_mask)) { spin_unlock(&desc->lock); Index: linux-2.6-tip/arch/x86/kernel/irqinit_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/irqinit_32.c +++ linux-2.6-tip/arch/x86/kernel/irqinit_32.c @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include @@ -50,6 +50,7 @@ static irqreturn_t math_error_irq(int cp */ static struct irqaction fpu_irq = { .handler = math_error_irq, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "fpu", }; @@ -78,6 +79,16 @@ void __init init_ISA_irqs(void) } } +/* + * IRQ2 is cascade interrupt to second interrupt controller + */ +static struct irqaction irq2 = { + .handler = no_action, + .flags = IRQF_NODELAY, + .mask = CPU_MASK_NONE, + .name = "cascade", +}; + DEFINE_PER_CPU(vector_irq_t, vector_irq) = { [0 ... IRQ0_VECTOR - 1] = -1, [IRQ0_VECTOR] = 0, @@ -111,28 +122,8 @@ int vector_used_by_percpu_irq(unsigned i return 0; } -/* Overridden in paravirt.c */ -void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ"))); - -void __init native_init_IRQ(void) +static void __init smp_intr_init(void) { - int i; - - /* all the set up before the call gates are initialised */ - pre_intr_init_hook(); - - /* - * Cover the whole vector space, no vector can escape - * us. (some of these will be overridden and become - * 'special' SMP interrupts) - */ - for (i = FIRST_EXTERNAL_VECTOR; i < NR_VECTORS; i++) { - /* SYSCALL_VECTOR was reserved in trap_init. */ - if (i != SYSCALL_VECTOR) - set_intr_gate(i, interrupt[i-FIRST_EXTERNAL_VECTOR]); - } - - #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_SMP) /* * The reschedule interrupt is a CPU-to-CPU reschedule-helper @@ -140,8 +131,15 @@ void __init native_init_IRQ(void) */ alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt); - /* IPI for invalidation */ - alloc_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt); + /* IPIs for invalidation */ + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+0, invalidate_interrupt0); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+1, invalidate_interrupt1); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+2, invalidate_interrupt2); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+3, invalidate_interrupt3); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+4, invalidate_interrupt4); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+5, invalidate_interrupt5); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+6, invalidate_interrupt6); + alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+7, invalidate_interrupt7); /* IPI for generic function call */ alloc_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt); @@ -154,6 +152,11 @@ void __init native_init_IRQ(void) set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt); set_bit(IRQ_MOVE_CLEANUP_VECTOR, used_vectors); #endif +} + +static void __init apic_intr_init(void) +{ + smp_intr_init(); #ifdef CONFIG_X86_LOCAL_APIC /* self generated IPI for local APIC timer */ @@ -162,17 +165,49 @@ void __init native_init_IRQ(void) /* IPI vectors for APIC spurious and error interrupts */ alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt); alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt); -#endif +# ifdef CONFIG_PERF_COUNTERS + alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt); +# endif -#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL) +# ifdef CONFIG_X86_MCE_P4THERMAL /* thermal monitor LVT interrupt */ alloc_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt); +# endif #endif +} + +/* Overridden in paravirt.c */ +void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ"))); + +void __init native_init_IRQ(void) +{ + int i; + + /* Execute any quirks before the call gates are initialised: */ + x86_quirk_pre_intr_init(); - /* setup after call gates are initialised (usually add in - * the architecture specific gates) + apic_intr_init(); + + /* + * Cover the whole vector space, no vector can escape + * us. (some of these will be overridden and become + * 'special' SMP interrupts) + */ + for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { + int vector = FIRST_EXTERNAL_VECTOR + i; + /* SYSCALL_VECTOR was reserved in trap_init. */ + if (!test_bit(vector, used_vectors)) + set_intr_gate(vector, interrupt[i]); + } + + if (!acpi_ioapic) + setup_irq(2, &irq2); + + /* + * Call quirks after call gates are initialised (usually add in + * the architecture specific gates): */ - intr_init_hook(); + x86_quirk_intr_init(); /* * External FPU? Set up irq13 if so, for Index: linux-2.6-tip/arch/x86/kernel/irqinit_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/irqinit_64.c +++ linux-2.6-tip/arch/x86/kernel/irqinit_64.c @@ -150,6 +150,11 @@ static void __init apic_intr_init(void) /* IPI vectors for APIC spurious and error interrupts */ alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt); alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt); + + /* Performance monitoring interrupt: */ +#ifdef CONFIG_PERF_COUNTERS + alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt); +#endif } void __init native_init_IRQ(void) @@ -157,6 +162,9 @@ void __init native_init_IRQ(void) int i; init_ISA_irqs(); + + apic_intr_init(); + /* * Cover the whole vector space, no vector can escape * us. (some of these will be overridden and become @@ -164,12 +172,10 @@ void __init native_init_IRQ(void) */ for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { int vector = FIRST_EXTERNAL_VECTOR + i; - if (vector != IA32_SYSCALL_VECTOR) + if (!test_bit(vector, used_vectors)) set_intr_gate(vector, interrupt[i]); } - apic_intr_init(); - if (!acpi_ioapic) setup_irq(2, &irq2); } Index: linux-2.6-tip/arch/x86/kernel/kgdb.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/kgdb.c +++ linux-2.6-tip/arch/x86/kernel/kgdb.c @@ -46,7 +46,7 @@ #include #include -#include +#include /* * Put the error code here just in case the user cares: @@ -347,7 +347,7 @@ void kgdb_post_primary_code(struct pt_re */ void kgdb_roundup_cpus(unsigned long flags) { - send_IPI_allbutself(APIC_DM_NMI); + apic->send_IPI_allbutself(APIC_DM_NMI); } #endif Index: linux-2.6-tip/arch/x86/kernel/kprobes.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/kprobes.c +++ linux-2.6-tip/arch/x86/kernel/kprobes.c @@ -446,12 +446,12 @@ void __kprobes arch_prepare_kretprobe(st static void __kprobes setup_singlestep(struct kprobe *p, struct pt_regs *regs, struct kprobe_ctlblk *kcb) { -#if !defined(CONFIG_PREEMPT) || defined(CONFIG_FREEZER) +#if !defined(CONFIG_PREEMPT) || defined(CONFIG_PM) if (p->ainsn.boostable == 1 && !p->post_handler) { /* Boost up -- we can execute copied instructions directly */ reset_current_kprobe(); regs->ip = (unsigned long)p->ainsn.insn; - preempt_enable_no_resched(); + preempt_enable(); return; } #endif @@ -477,7 +477,7 @@ static int __kprobes reenter_kprobe(stru arch_disarm_kprobe(p); regs->ip = (unsigned long)p->addr; reset_current_kprobe(); - preempt_enable_no_resched(); + preempt_enable(); break; #endif case KPROBE_HIT_ACTIVE: @@ -573,7 +573,7 @@ static int __kprobes kprobe_handler(stru } } /* else: not a kprobe fault; let the kernel handle it */ - preempt_enable_no_resched(); + preempt_enable(); return 0; } @@ -872,7 +872,7 @@ static int __kprobes post_kprobe_handler } reset_current_kprobe(); out: - preempt_enable_no_resched(); + preempt_enable(); /* * if somebody else is singlestepping across a probe point, flags @@ -906,7 +906,7 @@ int __kprobes kprobe_fault_handler(struc restore_previous_kprobe(kcb); else reset_current_kprobe(); - preempt_enable_no_resched(); + preempt_enable(); break; case KPROBE_HIT_ACTIVE: case KPROBE_HIT_SSDONE: @@ -1047,7 +1047,7 @@ int __kprobes longjmp_break_handler(stru memcpy((kprobe_opcode_t *)(kcb->jprobe_saved_sp), kcb->jprobes_stack, MIN_STACK_SIZE(kcb->jprobe_saved_sp)); - preempt_enable_no_resched(); + preempt_enable(); return 1; } return 0; Index: linux-2.6-tip/arch/x86/kernel/kvmclock.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/kvmclock.c +++ linux-2.6-tip/arch/x86/kernel/kvmclock.c @@ -19,7 +19,6 @@ #include #include #include -#include #include #include #include Index: linux-2.6-tip/arch/x86/kernel/machine_kexec_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/machine_kexec_32.c +++ linux-2.6-tip/arch/x86/kernel/machine_kexec_32.c @@ -121,7 +121,7 @@ static void machine_kexec_page_table_set static void machine_kexec_prepare_page_tables(struct kimage *image) { void *control_page; - pmd_t *pmd = 0; + pmd_t *pmd = NULL; control_page = page_address(image->control_code_page); #ifdef CONFIG_X86_PAE Index: linux-2.6-tip/arch/x86/kernel/machine_kexec_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/machine_kexec_64.c +++ linux-2.6-tip/arch/x86/kernel/machine_kexec_64.c @@ -18,15 +18,6 @@ #include #include -#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE))) -static u64 kexec_pgd[512] PAGE_ALIGNED; -static u64 kexec_pud0[512] PAGE_ALIGNED; -static u64 kexec_pmd0[512] PAGE_ALIGNED; -static u64 kexec_pte0[512] PAGE_ALIGNED; -static u64 kexec_pud1[512] PAGE_ALIGNED; -static u64 kexec_pmd1[512] PAGE_ALIGNED; -static u64 kexec_pte1[512] PAGE_ALIGNED; - static void init_level2_page(pmd_t *level2p, unsigned long addr) { unsigned long end_addr; @@ -107,12 +98,65 @@ out: return result; } +static void free_transition_pgtable(struct kimage *image) +{ + free_page((unsigned long)image->arch.pud); + free_page((unsigned long)image->arch.pmd); + free_page((unsigned long)image->arch.pte); +} + +static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) +{ + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + unsigned long vaddr, paddr; + int result = -ENOMEM; + + vaddr = (unsigned long)relocate_kernel; + paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); + pgd += pgd_index(vaddr); + if (!pgd_present(*pgd)) { + pud = (pud_t *)get_zeroed_page(GFP_KERNEL); + if (!pud) + goto err; + image->arch.pud = pud; + set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); + } + pud = pud_offset(pgd, vaddr); + if (!pud_present(*pud)) { + pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); + if (!pmd) + goto err; + image->arch.pmd = pmd; + set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); + } + pmd = pmd_offset(pud, vaddr); + if (!pmd_present(*pmd)) { + pte = (pte_t *)get_zeroed_page(GFP_KERNEL); + if (!pte) + goto err; + image->arch.pte = pte; + set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); + } + pte = pte_offset_kernel(pmd, vaddr); + set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); + return 0; +err: + free_transition_pgtable(image); + return result; +} + static int init_pgtable(struct kimage *image, unsigned long start_pgtable) { pgd_t *level4p; + int result; level4p = (pgd_t *)__va(start_pgtable); - return init_level4_page(image, level4p, 0, max_pfn << PAGE_SHIFT); + result = init_level4_page(image, level4p, 0, max_pfn << PAGE_SHIFT); + if (result) + return result; + return init_transition_pgtable(image, level4p); } static void set_idt(void *newidt, u16 limit) @@ -174,7 +218,7 @@ int machine_kexec_prepare(struct kimage void machine_kexec_cleanup(struct kimage *image) { - return; + free_transition_pgtable(image); } /* @@ -195,22 +239,6 @@ void machine_kexec(struct kimage *image) memcpy(control_page, relocate_kernel, PAGE_SIZE); page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page); - page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel; - page_list[PA_PGD] = virt_to_phys(&kexec_pgd); - page_list[VA_PGD] = (unsigned long)kexec_pgd; - page_list[PA_PUD_0] = virt_to_phys(&kexec_pud0); - page_list[VA_PUD_0] = (unsigned long)kexec_pud0; - page_list[PA_PMD_0] = virt_to_phys(&kexec_pmd0); - page_list[VA_PMD_0] = (unsigned long)kexec_pmd0; - page_list[PA_PTE_0] = virt_to_phys(&kexec_pte0); - page_list[VA_PTE_0] = (unsigned long)kexec_pte0; - page_list[PA_PUD_1] = virt_to_phys(&kexec_pud1); - page_list[VA_PUD_1] = (unsigned long)kexec_pud1; - page_list[PA_PMD_1] = virt_to_phys(&kexec_pmd1); - page_list[VA_PMD_1] = (unsigned long)kexec_pmd1; - page_list[PA_PTE_1] = virt_to_phys(&kexec_pte1); - page_list[VA_PTE_1] = (unsigned long)kexec_pte1; - page_list[PA_TABLE_PAGE] = (unsigned long)__pa(page_address(image->control_code_page)); Index: linux-2.6-tip/arch/x86/kernel/mca_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/mca_32.c +++ linux-2.6-tip/arch/x86/kernel/mca_32.c @@ -51,7 +51,6 @@ #include #include #include -#include static unsigned char which_scsi; @@ -474,6 +473,4 @@ void __kprobes mca_handle_nmi(void) * adapter was responsible for the error. */ bus_for_each_dev(&mca_bus_type, NULL, NULL, mca_handle_nmi_callback); - - mca_nmi_hook(); -} /* mca_handle_nmi */ +} Index: linux-2.6-tip/arch/x86/kernel/microcode_intel.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/microcode_intel.c +++ linux-2.6-tip/arch/x86/kernel/microcode_intel.c @@ -87,9 +87,9 @@ #include #include #include +#include #include -#include #include #include @@ -150,7 +150,7 @@ struct extended_sigtable { #define exttable_size(et) ((et)->count * EXT_SIGNATURE_SIZE + EXT_HEADER_SIZE) /* serialize access to the physical write to MSR 0x79 */ -static DEFINE_SPINLOCK(microcode_update_lock); +static DEFINE_RAW_SPINLOCK(microcode_update_lock); static int collect_cpu_info(int cpu_num, struct cpu_signature *csig) { @@ -196,7 +196,7 @@ static inline int update_match_cpu(struc return (!sigmatch(sig, csig->sig, pf, csig->pf)) ? 0 : 1; } -static inline int +static inline int update_match_revision(struct microcode_header_intel *mc_header, int rev) { return (mc_header->rev <= rev) ? 0 : 1; @@ -442,8 +442,8 @@ static int request_microcode_fw(int cpu, return ret; } - ret = generic_load_microcode(cpu, (void*)firmware->data, firmware->size, - &get_ucode_fw); + ret = generic_load_microcode(cpu, (void *)firmware->data, + firmware->size, &get_ucode_fw); release_firmware(firmware); @@ -460,7 +460,7 @@ static int request_microcode_user(int cp /* We should bind the task to the CPU */ BUG_ON(cpu != raw_smp_processor_id()); - return generic_load_microcode(cpu, (void*)buf, size, &get_ucode_user); + return generic_load_microcode(cpu, (void *)buf, size, &get_ucode_user); } static void microcode_fini_cpu(int cpu) Index: linux-2.6-tip/arch/x86/kernel/module_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/module_32.c +++ linux-2.6-tip/arch/x86/kernel/module_32.c @@ -42,7 +42,7 @@ void module_free(struct module *mod, voi { vfree(module_region); /* FIXME: If module_region == mod->init_region, trim exception - table entries. */ + table entries. */ } /* We don't need anything special. */ @@ -113,13 +113,13 @@ int module_finalize(const Elf_Ehdr *hdr, *para = NULL; char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; - for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { + for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { if (!strcmp(".text", secstrings + s->sh_name)) text = s; if (!strcmp(".altinstructions", secstrings + s->sh_name)) alt = s; if (!strcmp(".smp_locks", secstrings + s->sh_name)) - locks= s; + locks = s; if (!strcmp(".parainstructions", secstrings + s->sh_name)) para = s; } Index: linux-2.6-tip/arch/x86/kernel/module_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/module_64.c +++ linux-2.6-tip/arch/x86/kernel/module_64.c @@ -30,14 +30,14 @@ #include #include -#define DEBUGP(fmt...) +#define DEBUGP(fmt...) #ifndef CONFIG_UML void module_free(struct module *mod, void *module_region) { vfree(module_region); /* FIXME: If module_region == mod->init_region, trim exception - table entries. */ + table entries. */ } void *module_alloc(unsigned long size) @@ -77,7 +77,7 @@ int apply_relocate_add(Elf64_Shdr *sechd Elf64_Rela *rel = (void *)sechdrs[relsec].sh_addr; Elf64_Sym *sym; void *loc; - u64 val; + u64 val; DEBUGP("Applying relocate section %u to %u\n", relsec, sechdrs[relsec].sh_info); @@ -91,11 +91,11 @@ int apply_relocate_add(Elf64_Shdr *sechd sym = (Elf64_Sym *)sechdrs[symindex].sh_addr + ELF64_R_SYM(rel[i].r_info); - DEBUGP("type %d st_value %Lx r_addend %Lx loc %Lx\n", - (int)ELF64_R_TYPE(rel[i].r_info), - sym->st_value, rel[i].r_addend, (u64)loc); + DEBUGP("type %d st_value %Lx r_addend %Lx loc %Lx\n", + (int)ELF64_R_TYPE(rel[i].r_info), + sym->st_value, rel[i].r_addend, (u64)loc); - val = sym->st_value + rel[i].r_addend; + val = sym->st_value + rel[i].r_addend; switch (ELF64_R_TYPE(rel[i].r_info)) { case R_X86_64_NONE: @@ -113,16 +113,16 @@ int apply_relocate_add(Elf64_Shdr *sechd if ((s64)val != *(s32 *)loc) goto overflow; break; - case R_X86_64_PC32: + case R_X86_64_PC32: val -= (u64)loc; *(u32 *)loc = val; #if 0 if ((s64)val != *(s32 *)loc) - goto overflow; + goto overflow; #endif break; default: - printk(KERN_ERR "module %s: Unknown rela relocation: %Lu\n", + printk(KERN_ERR "module %s: Unknown rela relocation: %llu\n", me->name, ELF64_R_TYPE(rel[i].r_info)); return -ENOEXEC; } @@ -130,7 +130,7 @@ int apply_relocate_add(Elf64_Shdr *sechd return 0; overflow: - printk(KERN_ERR "overflow in relocation type %d val %Lx\n", + printk(KERN_ERR "overflow in relocation type %d val %Lx\n", (int)ELF64_R_TYPE(rel[i].r_info), val); printk(KERN_ERR "`%s' likely not compiled with -mcmodel=kernel\n", me->name); @@ -143,13 +143,13 @@ int apply_relocate(Elf_Shdr *sechdrs, unsigned int relsec, struct module *me) { - printk("non add relocation not supported\n"); + printk(KERN_ERR "non add relocation not supported\n"); return -ENOSYS; -} +} int module_finalize(const Elf_Ehdr *hdr, - const Elf_Shdr *sechdrs, - struct module *me) + const Elf_Shdr *sechdrs, + struct module *me) { const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL, *para = NULL; @@ -161,7 +161,7 @@ int module_finalize(const Elf_Ehdr *hdr, if (!strcmp(".altinstructions", secstrings + s->sh_name)) alt = s; if (!strcmp(".smp_locks", secstrings + s->sh_name)) - locks= s; + locks = s; if (!strcmp(".parainstructions", secstrings + s->sh_name)) para = s; } Index: linux-2.6-tip/arch/x86/kernel/mpparse.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/mpparse.c +++ linux-2.6-tip/arch/x86/kernel/mpparse.c @@ -3,7 +3,7 @@ * compliant MP-table parsing routines. * * (c) 1995 Alan Cox, Building #3 - * (c) 1998, 1999, 2000 Ingo Molnar + * (c) 1998, 1999, 2000, 2009 Ingo Molnar * (c) 2008 Alexey Starikovskiy */ @@ -29,12 +29,7 @@ #include #include -#include -#ifdef CONFIG_X86_32 -#include -#include -#endif - +#include /* * Checksum an MP configuration block. */ @@ -144,11 +139,11 @@ static void __init MP_ioapic_info(struct if (bad_ioapic(m->apicaddr)) return; - mp_ioapics[nr_ioapics].mp_apicaddr = m->apicaddr; - mp_ioapics[nr_ioapics].mp_apicid = m->apicid; - mp_ioapics[nr_ioapics].mp_type = m->type; - mp_ioapics[nr_ioapics].mp_apicver = m->apicver; - mp_ioapics[nr_ioapics].mp_flags = m->flags; + mp_ioapics[nr_ioapics].apicaddr = m->apicaddr; + mp_ioapics[nr_ioapics].apicid = m->apicid; + mp_ioapics[nr_ioapics].type = m->type; + mp_ioapics[nr_ioapics].apicver = m->apicver; + mp_ioapics[nr_ioapics].flags = m->flags; nr_ioapics++; } @@ -160,55 +155,55 @@ static void print_MP_intsrc_info(struct m->srcbusirq, m->dstapic, m->dstirq); } -static void __init print_mp_irq_info(struct mp_config_intsrc *mp_irq) +static void __init print_mp_irq_info(struct mpc_intsrc *mp_irq) { apic_printk(APIC_VERBOSE, "Int: type %d, pol %d, trig %d, bus %02x," " IRQ %02x, APIC ID %x, APIC INT %02x\n", - mp_irq->mp_irqtype, mp_irq->mp_irqflag & 3, - (mp_irq->mp_irqflag >> 2) & 3, mp_irq->mp_srcbus, - mp_irq->mp_srcbusirq, mp_irq->mp_dstapic, mp_irq->mp_dstirq); + mp_irq->irqtype, mp_irq->irqflag & 3, + (mp_irq->irqflag >> 2) & 3, mp_irq->srcbus, + mp_irq->srcbusirq, mp_irq->dstapic, mp_irq->dstirq); } static void __init assign_to_mp_irq(struct mpc_intsrc *m, - struct mp_config_intsrc *mp_irq) + struct mpc_intsrc *mp_irq) { - mp_irq->mp_dstapic = m->dstapic; - mp_irq->mp_type = m->type; - mp_irq->mp_irqtype = m->irqtype; - mp_irq->mp_irqflag = m->irqflag; - mp_irq->mp_srcbus = m->srcbus; - mp_irq->mp_srcbusirq = m->srcbusirq; - mp_irq->mp_dstirq = m->dstirq; + mp_irq->dstapic = m->dstapic; + mp_irq->type = m->type; + mp_irq->irqtype = m->irqtype; + mp_irq->irqflag = m->irqflag; + mp_irq->srcbus = m->srcbus; + mp_irq->srcbusirq = m->srcbusirq; + mp_irq->dstirq = m->dstirq; } -static void __init assign_to_mpc_intsrc(struct mp_config_intsrc *mp_irq, +static void __init assign_to_mpc_intsrc(struct mpc_intsrc *mp_irq, struct mpc_intsrc *m) { - m->dstapic = mp_irq->mp_dstapic; - m->type = mp_irq->mp_type; - m->irqtype = mp_irq->mp_irqtype; - m->irqflag = mp_irq->mp_irqflag; - m->srcbus = mp_irq->mp_srcbus; - m->srcbusirq = mp_irq->mp_srcbusirq; - m->dstirq = mp_irq->mp_dstirq; + m->dstapic = mp_irq->dstapic; + m->type = mp_irq->type; + m->irqtype = mp_irq->irqtype; + m->irqflag = mp_irq->irqflag; + m->srcbus = mp_irq->srcbus; + m->srcbusirq = mp_irq->srcbusirq; + m->dstirq = mp_irq->dstirq; } -static int __init mp_irq_mpc_intsrc_cmp(struct mp_config_intsrc *mp_irq, +static int __init mp_irq_mpc_intsrc_cmp(struct mpc_intsrc *mp_irq, struct mpc_intsrc *m) { - if (mp_irq->mp_dstapic != m->dstapic) + if (mp_irq->dstapic != m->dstapic) return 1; - if (mp_irq->mp_type != m->type) + if (mp_irq->type != m->type) return 2; - if (mp_irq->mp_irqtype != m->irqtype) + if (mp_irq->irqtype != m->irqtype) return 3; - if (mp_irq->mp_irqflag != m->irqflag) + if (mp_irq->irqflag != m->irqflag) return 4; - if (mp_irq->mp_srcbus != m->srcbus) + if (mp_irq->srcbus != m->srcbus) return 5; - if (mp_irq->mp_srcbusirq != m->srcbusirq) + if (mp_irq->srcbusirq != m->srcbusirq) return 6; - if (mp_irq->mp_dstirq != m->dstirq) + if (mp_irq->dstirq != m->dstirq) return 7; return 0; @@ -292,16 +287,7 @@ static int __init smp_read_mpc(struct mp return 0; #ifdef CONFIG_X86_32 - /* - * need to make sure summit and es7000's mps_oem_check is safe to be - * called early via genericarch 's mps_oem_check - */ - if (early) { -#ifdef CONFIG_X86_NUMAQ - numaq_mps_oem_check(mpc, oem, str); -#endif - } else - mps_oem_check(mpc, oem, str); + generic_mps_oem_check(mpc, oem, str); #endif /* save the local APIC address, it might be non-default */ if (!acpi_lapic) @@ -386,13 +372,13 @@ static int __init smp_read_mpc(struct mp (*x86_quirks->mpc_record)++; } -#ifdef CONFIG_X86_GENERICARCH - generic_bigsmp_probe(); +#ifdef CONFIG_X86_BIGSMP + generic_bigsmp_probe(); #endif -#ifdef CONFIG_X86_32 - setup_apic_routing(); -#endif + if (apic->setup_apic_routing) + apic->setup_apic_routing(); + if (!num_processors) printk(KERN_ERR "MPTABLE: no processors registered!\n"); return num_processors; @@ -417,7 +403,7 @@ static void __init construct_default_ioi intsrc.type = MP_INTSRC; intsrc.irqflag = 0; /* conforming */ intsrc.srcbus = 0; - intsrc.dstapic = mp_ioapics[0].mp_apicid; + intsrc.dstapic = mp_ioapics[0].apicid; intsrc.irqtype = mp_INT; @@ -570,14 +556,14 @@ static inline void __init construct_defa } } -static struct intel_mp_floating *mpf_found; +static struct mpf_intel *mpf_found; /* * Scan the memory blocks for an SMP configuration block. */ static void __init __get_smp_config(unsigned int early) { - struct intel_mp_floating *mpf = mpf_found; + struct mpf_intel *mpf = mpf_found; if (!mpf) return; @@ -598,9 +584,9 @@ static void __init __get_smp_config(unsi } printk(KERN_INFO "Intel MultiProcessor Specification v1.%d\n", - mpf->mpf_specification); + mpf->specification); #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32) - if (mpf->mpf_feature2 & (1 << 7)) { + if (mpf->feature2 & (1 << 7)) { printk(KERN_INFO " IMCR and PIC compatibility mode.\n"); pic_mode = 1; } else { @@ -611,7 +597,7 @@ static void __init __get_smp_config(unsi /* * Now see if we need to read further. */ - if (mpf->mpf_feature1 != 0) { + if (mpf->feature1 != 0) { if (early) { /* * local APIC has default address @@ -621,16 +607,16 @@ static void __init __get_smp_config(unsi } printk(KERN_INFO "Default MP configuration #%d\n", - mpf->mpf_feature1); - construct_default_ISA_mptable(mpf->mpf_feature1); + mpf->feature1); + construct_default_ISA_mptable(mpf->feature1); - } else if (mpf->mpf_physptr) { + } else if (mpf->physptr) { /* * Read the physical hardware table. Anything here will * override the defaults. */ - if (!smp_read_mpc(phys_to_virt(mpf->mpf_physptr), early)) { + if (!smp_read_mpc(phys_to_virt(mpf->physptr), early)) { #ifdef CONFIG_X86_LOCAL_APIC smp_found_config = 0; #endif @@ -688,32 +674,32 @@ static int __init smp_scan_config(unsign unsigned reserve) { unsigned int *bp = phys_to_virt(base); - struct intel_mp_floating *mpf; + struct mpf_intel *mpf; apic_printk(APIC_VERBOSE, "Scan SMP from %p for %ld bytes.\n", bp, length); BUILD_BUG_ON(sizeof(*mpf) != 16); while (length > 0) { - mpf = (struct intel_mp_floating *)bp; + mpf = (struct mpf_intel *)bp; if ((*bp == SMP_MAGIC_IDENT) && - (mpf->mpf_length == 1) && + (mpf->length == 1) && !mpf_checksum((unsigned char *)bp, 16) && - ((mpf->mpf_specification == 1) - || (mpf->mpf_specification == 4))) { + ((mpf->specification == 1) + || (mpf->specification == 4))) { #ifdef CONFIG_X86_LOCAL_APIC smp_found_config = 1; #endif mpf_found = mpf; - printk(KERN_INFO "found SMP MP-table at [%p] %08lx\n", - mpf, virt_to_phys(mpf)); + printk(KERN_INFO "found SMP MP-table at [%p] %llx\n", + mpf, (u64)virt_to_phys(mpf)); if (!reserve) return 1; reserve_bootmem_generic(virt_to_phys(mpf), PAGE_SIZE, BOOTMEM_DEFAULT); - if (mpf->mpf_physptr) { + if (mpf->physptr) { unsigned long size = PAGE_SIZE; #ifdef CONFIG_X86_32 /* @@ -722,15 +708,24 @@ static int __init smp_scan_config(unsign * the bottom is mapped now. * PC-9800's MPC table places on the very last * of physical memory; so that simply reserving - * PAGE_SIZE from mpg->mpf_physptr yields BUG() + * PAGE_SIZE from mpf->physptr yields BUG() * in reserve_bootmem. + * also need to make sure physptr is below than + * max_low_pfn + * we don't need reserve the area above max_low_pfn */ unsigned long end = max_low_pfn * PAGE_SIZE; - if (mpf->mpf_physptr + size > end) - size = end - mpf->mpf_physptr; -#endif - reserve_bootmem_generic(mpf->mpf_physptr, size, + + if (mpf->physptr < end) { + if (mpf->physptr + size > end) + size = end - mpf->physptr; + reserve_bootmem_generic(mpf->physptr, size, + BOOTMEM_DEFAULT); + } +#else + reserve_bootmem_generic(mpf->physptr, size, BOOTMEM_DEFAULT); +#endif } return 1; @@ -809,15 +804,15 @@ static int __init get_MP_intsrc_index(s /* not legacy */ for (i = 0; i < mp_irq_entries; i++) { - if (mp_irqs[i].mp_irqtype != mp_INT) + if (mp_irqs[i].irqtype != mp_INT) continue; - if (mp_irqs[i].mp_irqflag != 0x0f) + if (mp_irqs[i].irqflag != 0x0f) continue; - if (mp_irqs[i].mp_srcbus != m->srcbus) + if (mp_irqs[i].srcbus != m->srcbus) continue; - if (mp_irqs[i].mp_srcbusirq != m->srcbusirq) + if (mp_irqs[i].srcbusirq != m->srcbusirq) continue; if (irq_used[i]) { /* already claimed */ @@ -922,10 +917,10 @@ static int __init replace_intsrc_all(st if (irq_used[i]) continue; - if (mp_irqs[i].mp_irqtype != mp_INT) + if (mp_irqs[i].irqtype != mp_INT) continue; - if (mp_irqs[i].mp_irqflag != 0x0f) + if (mp_irqs[i].irqflag != 0x0f) continue; if (nr_m_spare > 0) { @@ -1001,7 +996,7 @@ static int __init update_mp_table(void) { char str[16]; char oem[10]; - struct intel_mp_floating *mpf; + struct mpf_intel *mpf; struct mpc_table *mpc, *mpc_new; if (!enable_update_mptable) @@ -1014,19 +1009,19 @@ static int __init update_mp_table(void) /* * Now see if we need to go further. */ - if (mpf->mpf_feature1 != 0) + if (mpf->feature1 != 0) return 0; - if (!mpf->mpf_physptr) + if (!mpf->physptr) return 0; - mpc = phys_to_virt(mpf->mpf_physptr); + mpc = phys_to_virt(mpf->physptr); if (!smp_check_mpc(mpc, oem, str)) return 0; - printk(KERN_INFO "mpf: %lx\n", virt_to_phys(mpf)); - printk(KERN_INFO "mpf_physptr: %x\n", mpf->mpf_physptr); + printk(KERN_INFO "mpf: %llx\n", (u64)virt_to_phys(mpf)); + printk(KERN_INFO "physptr: %x\n", mpf->physptr); if (mpc_new_phys && mpc->length > mpc_new_length) { mpc_new_phys = 0; @@ -1047,23 +1042,23 @@ static int __init update_mp_table(void) } printk(KERN_INFO "use in-positon replacing\n"); } else { - mpf->mpf_physptr = mpc_new_phys; + mpf->physptr = mpc_new_phys; mpc_new = phys_to_virt(mpc_new_phys); memcpy(mpc_new, mpc, mpc->length); mpc = mpc_new; /* check if we can modify that */ - if (mpc_new_phys - mpf->mpf_physptr) { - struct intel_mp_floating *mpf_new; + if (mpc_new_phys - mpf->physptr) { + struct mpf_intel *mpf_new; /* steal 16 bytes from [0, 1k) */ printk(KERN_INFO "mpf new: %x\n", 0x400 - 16); mpf_new = phys_to_virt(0x400 - 16); memcpy(mpf_new, mpf, 16); mpf = mpf_new; - mpf->mpf_physptr = mpc_new_phys; + mpf->physptr = mpc_new_phys; } - mpf->mpf_checksum = 0; - mpf->mpf_checksum -= mpf_checksum((unsigned char *)mpf, 16); - printk(KERN_INFO "mpf_physptr new: %x\n", mpf->mpf_physptr); + mpf->checksum = 0; + mpf->checksum -= mpf_checksum((unsigned char *)mpf, 16); + printk(KERN_INFO "physptr new: %x\n", mpf->physptr); } /* Index: linux-2.6-tip/arch/x86/kernel/msr.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/msr.c +++ linux-2.6-tip/arch/x86/kernel/msr.c @@ -35,10 +35,10 @@ #include #include #include +#include #include #include -#include #include static struct class *msr_class; Index: linux-2.6-tip/arch/x86/kernel/nmi.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/nmi.c +++ /dev/null @@ -1,572 +0,0 @@ -/* - * NMI watchdog support on APIC systems - * - * Started by Ingo Molnar - * - * Fixes: - * Mikael Pettersson : AMD K7 support for local APIC NMI watchdog. - * Mikael Pettersson : Power Management for local APIC NMI watchdog. - * Mikael Pettersson : Pentium 4 support for local APIC NMI watchdog. - * Pavel Machek and - * Mikael Pettersson : PM converted to driver model. Disable/enable API. - */ - -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include - -#include - -#include - -int unknown_nmi_panic; -int nmi_watchdog_enabled; - -static cpumask_t backtrace_mask = CPU_MASK_NONE; - -/* nmi_active: - * >0: the lapic NMI watchdog is active, but can be disabled - * <0: the lapic NMI watchdog has not been set up, and cannot - * be enabled - * 0: the lapic NMI watchdog is disabled, but can be enabled - */ -atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ -EXPORT_SYMBOL(nmi_active); - -unsigned int nmi_watchdog = NMI_NONE; -EXPORT_SYMBOL(nmi_watchdog); - -static int panic_on_timeout; - -static unsigned int nmi_hz = HZ; -static DEFINE_PER_CPU(short, wd_enabled); -static int endflag __initdata; - -static inline unsigned int get_nmi_count(int cpu) -{ -#ifdef CONFIG_X86_64 - return cpu_pda(cpu)->__nmi_count; -#else - return nmi_count(cpu); -#endif -} - -static inline int mce_in_progress(void) -{ -#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE) - return atomic_read(&mce_entry) > 0; -#endif - return 0; -} - -/* - * Take the local apic timer and PIT/HPET into account. We don't - * know which one is active, when we have highres/dyntick on - */ -static inline unsigned int get_timer_irqs(int cpu) -{ -#ifdef CONFIG_X86_64 - return read_pda(apic_timer_irqs) + read_pda(irq0_irqs); -#else - return per_cpu(irq_stat, cpu).apic_timer_irqs + - per_cpu(irq_stat, cpu).irq0_irqs; -#endif -} - -#ifdef CONFIG_SMP -/* - * The performance counters used by NMI_LOCAL_APIC don't trigger when - * the CPU is idle. To make sure the NMI watchdog really ticks on all - * CPUs during the test make them busy. - */ -static __init void nmi_cpu_busy(void *data) -{ - local_irq_enable_in_hardirq(); - /* - * Intentionally don't use cpu_relax here. This is - * to make sure that the performance counter really ticks, - * even if there is a simulator or similar that catches the - * pause instruction. On a real HT machine this is fine because - * all other CPUs are busy with "useless" delay loops and don't - * care if they get somewhat less cycles. - */ - while (endflag == 0) - mb(); -} -#endif - -static void report_broken_nmi(int cpu, int *prev_nmi_count) -{ - printk(KERN_CONT "\n"); - - printk(KERN_WARNING - "WARNING: CPU#%d: NMI appears to be stuck (%d->%d)!\n", - cpu, prev_nmi_count[cpu], get_nmi_count(cpu)); - - printk(KERN_WARNING - "Please report this to bugzilla.kernel.org,\n"); - printk(KERN_WARNING - "and attach the output of the 'dmesg' command.\n"); - - per_cpu(wd_enabled, cpu) = 0; - atomic_dec(&nmi_active); -} - -static void __acpi_nmi_disable(void *__unused) -{ - apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); -} - -int __init check_nmi_watchdog(void) -{ - unsigned int *prev_nmi_count; - int cpu; - - if (!nmi_watchdog_active() || !atomic_read(&nmi_active)) - return 0; - - prev_nmi_count = kmalloc(nr_cpu_ids * sizeof(int), GFP_KERNEL); - if (!prev_nmi_count) - goto error; - - printk(KERN_INFO "Testing NMI watchdog ... "); - -#ifdef CONFIG_SMP - if (nmi_watchdog == NMI_LOCAL_APIC) - smp_call_function(nmi_cpu_busy, (void *)&endflag, 0); -#endif - - for_each_possible_cpu(cpu) - prev_nmi_count[cpu] = get_nmi_count(cpu); - local_irq_enable(); - mdelay((20 * 1000) / nmi_hz); /* wait 20 ticks */ - - for_each_online_cpu(cpu) { - if (!per_cpu(wd_enabled, cpu)) - continue; - if (get_nmi_count(cpu) - prev_nmi_count[cpu] <= 5) - report_broken_nmi(cpu, prev_nmi_count); - } - endflag = 1; - if (!atomic_read(&nmi_active)) { - kfree(prev_nmi_count); - atomic_set(&nmi_active, -1); - goto error; - } - printk("OK.\n"); - - /* - * now that we know it works we can reduce NMI frequency to - * something more reasonable; makes a difference in some configs - */ - if (nmi_watchdog == NMI_LOCAL_APIC) - nmi_hz = lapic_adjust_nmi_hz(1); - - kfree(prev_nmi_count); - return 0; -error: - if (nmi_watchdog == NMI_IO_APIC) { - if (!timer_through_8259) - disable_8259A_irq(0); - on_each_cpu(__acpi_nmi_disable, NULL, 1); - } - -#ifdef CONFIG_X86_32 - timer_ack = 0; -#endif - return -1; -} - -static int __init setup_nmi_watchdog(char *str) -{ - unsigned int nmi; - - if (!strncmp(str, "panic", 5)) { - panic_on_timeout = 1; - str = strchr(str, ','); - if (!str) - return 1; - ++str; - } - - if (!strncmp(str, "lapic", 5)) - nmi_watchdog = NMI_LOCAL_APIC; - else if (!strncmp(str, "ioapic", 6)) - nmi_watchdog = NMI_IO_APIC; - else { - get_option(&str, &nmi); - if (nmi >= NMI_INVALID) - return 0; - nmi_watchdog = nmi; - } - - return 1; -} -__setup("nmi_watchdog=", setup_nmi_watchdog); - -/* - * Suspend/resume support - */ -#ifdef CONFIG_PM - -static int nmi_pm_active; /* nmi_active before suspend */ - -static int lapic_nmi_suspend(struct sys_device *dev, pm_message_t state) -{ - /* only CPU0 goes here, other CPUs should be offline */ - nmi_pm_active = atomic_read(&nmi_active); - stop_apic_nmi_watchdog(NULL); - BUG_ON(atomic_read(&nmi_active) != 0); - return 0; -} - -static int lapic_nmi_resume(struct sys_device *dev) -{ - /* only CPU0 goes here, other CPUs should be offline */ - if (nmi_pm_active > 0) { - setup_apic_nmi_watchdog(NULL); - touch_nmi_watchdog(); - } - return 0; -} - -static struct sysdev_class nmi_sysclass = { - .name = "lapic_nmi", - .resume = lapic_nmi_resume, - .suspend = lapic_nmi_suspend, -}; - -static struct sys_device device_lapic_nmi = { - .id = 0, - .cls = &nmi_sysclass, -}; - -static int __init init_lapic_nmi_sysfs(void) -{ - int error; - - /* - * should really be a BUG_ON but b/c this is an - * init call, it just doesn't work. -dcz - */ - if (nmi_watchdog != NMI_LOCAL_APIC) - return 0; - - if (atomic_read(&nmi_active) < 0) - return 0; - - error = sysdev_class_register(&nmi_sysclass); - if (!error) - error = sysdev_register(&device_lapic_nmi); - return error; -} - -/* must come after the local APIC's device_initcall() */ -late_initcall(init_lapic_nmi_sysfs); - -#endif /* CONFIG_PM */ - -static void __acpi_nmi_enable(void *__unused) -{ - apic_write(APIC_LVT0, APIC_DM_NMI); -} - -/* - * Enable timer based NMIs on all CPUs: - */ -void acpi_nmi_enable(void) -{ - if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) - on_each_cpu(__acpi_nmi_enable, NULL, 1); -} - -/* - * Disable timer based NMIs on all CPUs: - */ -void acpi_nmi_disable(void) -{ - if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) - on_each_cpu(__acpi_nmi_disable, NULL, 1); -} - -/* - * This function is called as soon the LAPIC NMI watchdog driver has everything - * in place and it's ready to check if the NMIs belong to the NMI watchdog - */ -void cpu_nmi_set_wd_enabled(void) -{ - __get_cpu_var(wd_enabled) = 1; -} - -void setup_apic_nmi_watchdog(void *unused) -{ - if (__get_cpu_var(wd_enabled)) - return; - - /* cheap hack to support suspend/resume */ - /* if cpu0 is not active neither should the other cpus */ - if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0) - return; - - switch (nmi_watchdog) { - case NMI_LOCAL_APIC: - if (lapic_watchdog_init(nmi_hz) < 0) { - __get_cpu_var(wd_enabled) = 0; - return; - } - /* FALL THROUGH */ - case NMI_IO_APIC: - __get_cpu_var(wd_enabled) = 1; - atomic_inc(&nmi_active); - } -} - -void stop_apic_nmi_watchdog(void *unused) -{ - /* only support LOCAL and IO APICs for now */ - if (!nmi_watchdog_active()) - return; - if (__get_cpu_var(wd_enabled) == 0) - return; - if (nmi_watchdog == NMI_LOCAL_APIC) - lapic_watchdog_stop(); - else - __acpi_nmi_disable(NULL); - __get_cpu_var(wd_enabled) = 0; - atomic_dec(&nmi_active); -} - -/* - * the best way to detect whether a CPU has a 'hard lockup' problem - * is to check it's local APIC timer IRQ counts. If they are not - * changing then that CPU has some problem. - * - * as these watchdog NMI IRQs are generated on every CPU, we only - * have to check the current processor. - * - * since NMIs don't listen to _any_ locks, we have to be extremely - * careful not to rely on unsafe variables. The printk might lock - * up though, so we have to break up any console locks first ... - * [when there will be more tty-related locks, break them up here too!] - */ - -static DEFINE_PER_CPU(unsigned, last_irq_sum); -static DEFINE_PER_CPU(local_t, alert_counter); -static DEFINE_PER_CPU(int, nmi_touch); - -void touch_nmi_watchdog(void) -{ - if (nmi_watchdog_active()) { - unsigned cpu; - - /* - * Tell other CPUs to reset their alert counters. We cannot - * do it ourselves because the alert count increase is not - * atomic. - */ - for_each_present_cpu(cpu) { - if (per_cpu(nmi_touch, cpu) != 1) - per_cpu(nmi_touch, cpu) = 1; - } - } - - /* - * Tickle the softlockup detector too: - */ - touch_softlockup_watchdog(); -} -EXPORT_SYMBOL(touch_nmi_watchdog); - -notrace __kprobes int -nmi_watchdog_tick(struct pt_regs *regs, unsigned reason) -{ - /* - * Since current_thread_info()-> is always on the stack, and we - * always switch the stack NMI-atomically, it's safe to use - * smp_processor_id(). - */ - unsigned int sum; - int touched = 0; - int cpu = smp_processor_id(); - int rc = 0; - - /* check for other users first */ - if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) - == NOTIFY_STOP) { - rc = 1; - touched = 1; - } - - sum = get_timer_irqs(cpu); - - if (__get_cpu_var(nmi_touch)) { - __get_cpu_var(nmi_touch) = 0; - touched = 1; - } - - if (cpu_isset(cpu, backtrace_mask)) { - static DEFINE_SPINLOCK(lock); /* Serialise the printks */ - - spin_lock(&lock); - printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu); - dump_stack(); - spin_unlock(&lock); - cpu_clear(cpu, backtrace_mask); - } - - /* Could check oops_in_progress here too, but it's safer not to */ - if (mce_in_progress()) - touched = 1; - - /* if the none of the timers isn't firing, this cpu isn't doing much */ - if (!touched && __get_cpu_var(last_irq_sum) == sum) { - /* - * Ayiee, looks like this CPU is stuck ... - * wait a few IRQs (5 seconds) before doing the oops ... - */ - local_inc(&__get_cpu_var(alert_counter)); - if (local_read(&__get_cpu_var(alert_counter)) == 5 * nmi_hz) - /* - * die_nmi will return ONLY if NOTIFY_STOP happens.. - */ - die_nmi("BUG: NMI Watchdog detected LOCKUP", - regs, panic_on_timeout); - } else { - __get_cpu_var(last_irq_sum) = sum; - local_set(&__get_cpu_var(alert_counter), 0); - } - - /* see if the nmi watchdog went off */ - if (!__get_cpu_var(wd_enabled)) - return rc; - switch (nmi_watchdog) { - case NMI_LOCAL_APIC: - rc |= lapic_wd_event(nmi_hz); - break; - case NMI_IO_APIC: - /* - * don't know how to accurately check for this. - * just assume it was a watchdog timer interrupt - * This matches the old behaviour. - */ - rc = 1; - break; - } - return rc; -} - -#ifdef CONFIG_SYSCTL - -static void enable_ioapic_nmi_watchdog_single(void *unused) -{ - __get_cpu_var(wd_enabled) = 1; - atomic_inc(&nmi_active); - __acpi_nmi_enable(NULL); -} - -static void enable_ioapic_nmi_watchdog(void) -{ - on_each_cpu(enable_ioapic_nmi_watchdog_single, NULL, 1); - touch_nmi_watchdog(); -} - -static void disable_ioapic_nmi_watchdog(void) -{ - on_each_cpu(stop_apic_nmi_watchdog, NULL, 1); -} - -static int __init setup_unknown_nmi_panic(char *str) -{ - unknown_nmi_panic = 1; - return 1; -} -__setup("unknown_nmi_panic", setup_unknown_nmi_panic); - -static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu) -{ - unsigned char reason = get_nmi_reason(); - char buf[64]; - - sprintf(buf, "NMI received for unknown reason %02x\n", reason); - die_nmi(buf, regs, 1); /* Always panic here */ - return 0; -} - -/* - * proc handler for /proc/sys/kernel/nmi - */ -int proc_nmi_enabled(struct ctl_table *table, int write, struct file *file, - void __user *buffer, size_t *length, loff_t *ppos) -{ - int old_state; - - nmi_watchdog_enabled = (atomic_read(&nmi_active) > 0) ? 1 : 0; - old_state = nmi_watchdog_enabled; - proc_dointvec(table, write, file, buffer, length, ppos); - if (!!old_state == !!nmi_watchdog_enabled) - return 0; - - if (atomic_read(&nmi_active) < 0 || !nmi_watchdog_active()) { - printk(KERN_WARNING - "NMI watchdog is permanently disabled\n"); - return -EIO; - } - - if (nmi_watchdog == NMI_LOCAL_APIC) { - if (nmi_watchdog_enabled) - enable_lapic_nmi_watchdog(); - else - disable_lapic_nmi_watchdog(); - } else if (nmi_watchdog == NMI_IO_APIC) { - if (nmi_watchdog_enabled) - enable_ioapic_nmi_watchdog(); - else - disable_ioapic_nmi_watchdog(); - } else { - printk(KERN_WARNING - "NMI watchdog doesn't know what hardware to touch\n"); - return -EIO; - } - return 0; -} - -#endif /* CONFIG_SYSCTL */ - -int do_nmi_callback(struct pt_regs *regs, int cpu) -{ -#ifdef CONFIG_SYSCTL - if (unknown_nmi_panic) - return unknown_nmi_panic_callback(regs, cpu); -#endif - return 0; -} - -void __trigger_all_cpu_backtrace(void) -{ - int i; - - backtrace_mask = cpu_online_map; - /* Wait for up to 10 seconds for all CPUs to do the backtrace */ - for (i = 0; i < 10 * 1000; i++) { - if (cpus_empty(backtrace_mask)) - break; - mdelay(1); - } -} Index: linux-2.6-tip/arch/x86/kernel/numaq_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/numaq_32.c +++ /dev/null @@ -1,293 +0,0 @@ -/* - * Written by: Patricia Gaughen, IBM Corporation - * - * Copyright (C) 2002, IBM Corp. - * - * All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or - * NON INFRINGEMENT. See the GNU General Public License for more - * details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - * Send feedback to - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define MB_TO_PAGES(addr) ((addr) << (20 - PAGE_SHIFT)) - -/* - * Function: smp_dump_qct() - * - * Description: gets memory layout from the quad config table. This - * function also updates node_online_map with the nodes (quads) present. - */ -static void __init smp_dump_qct(void) -{ - int node; - struct eachquadmem *eq; - struct sys_cfg_data *scd = - (struct sys_cfg_data *)__va(SYS_CFG_DATA_PRIV_ADDR); - - nodes_clear(node_online_map); - for_each_node(node) { - if (scd->quads_present31_0 & (1 << node)) { - node_set_online(node); - eq = &scd->eq[node]; - /* Convert to pages */ - node_start_pfn[node] = MB_TO_PAGES( - eq->hi_shrd_mem_start - eq->priv_mem_size); - node_end_pfn[node] = MB_TO_PAGES( - eq->hi_shrd_mem_start + eq->hi_shrd_mem_size); - - e820_register_active_regions(node, node_start_pfn[node], - node_end_pfn[node]); - memory_present(node, - node_start_pfn[node], node_end_pfn[node]); - node_remap_size[node] = node_memmap_size_bytes(node, - node_start_pfn[node], - node_end_pfn[node]); - } - } -} - - -void __cpuinit numaq_tsc_disable(void) -{ - if (!found_numaq) - return; - - if (num_online_nodes() > 1) { - printk(KERN_DEBUG "NUMAQ: disabling TSC\n"); - setup_clear_cpu_cap(X86_FEATURE_TSC); - } -} - -static int __init numaq_pre_time_init(void) -{ - numaq_tsc_disable(); - return 0; -} - -int found_numaq; -/* - * Have to match translation table entries to main table entries by counter - * hence the mpc_record variable .... can't see a less disgusting way of - * doing this .... - */ -struct mpc_config_translation { - unsigned char mpc_type; - unsigned char trans_len; - unsigned char trans_type; - unsigned char trans_quad; - unsigned char trans_global; - unsigned char trans_local; - unsigned short trans_reserved; -}; - -/* x86_quirks member */ -static int mpc_record; -static struct mpc_config_translation *translation_table[MAX_MPC_ENTRY] - __cpuinitdata; - -static inline int generate_logical_apicid(int quad, int phys_apicid) -{ - return (quad << 4) + (phys_apicid ? phys_apicid << 1 : 1); -} - -/* x86_quirks member */ -static int mpc_apic_id(struct mpc_cpu *m) -{ - int quad = translation_table[mpc_record]->trans_quad; - int logical_apicid = generate_logical_apicid(quad, m->apicid); - - printk(KERN_DEBUG "Processor #%d %u:%u APIC version %d (quad %d, apic %d)\n", - m->apicid, (m->cpufeature & CPU_FAMILY_MASK) >> 8, - (m->cpufeature & CPU_MODEL_MASK) >> 4, - m->apicver, quad, logical_apicid); - return logical_apicid; -} - -int mp_bus_id_to_node[MAX_MP_BUSSES]; - -int mp_bus_id_to_local[MAX_MP_BUSSES]; - -/* x86_quirks member */ -static void mpc_oem_bus_info(struct mpc_bus *m, char *name) -{ - int quad = translation_table[mpc_record]->trans_quad; - int local = translation_table[mpc_record]->trans_local; - - mp_bus_id_to_node[m->busid] = quad; - mp_bus_id_to_local[m->busid] = local; - printk(KERN_INFO "Bus #%d is %s (node %d)\n", - m->busid, name, quad); -} - -int quad_local_to_mp_bus_id [NR_CPUS/4][4]; - -/* x86_quirks member */ -static void mpc_oem_pci_bus(struct mpc_bus *m) -{ - int quad = translation_table[mpc_record]->trans_quad; - int local = translation_table[mpc_record]->trans_local; - - quad_local_to_mp_bus_id[quad][local] = m->busid; -} - -static void __init MP_translation_info(struct mpc_config_translation *m) -{ - printk(KERN_INFO - "Translation: record %d, type %d, quad %d, global %d, local %d\n", - mpc_record, m->trans_type, m->trans_quad, m->trans_global, - m->trans_local); - - if (mpc_record >= MAX_MPC_ENTRY) - printk(KERN_ERR "MAX_MPC_ENTRY exceeded!\n"); - else - translation_table[mpc_record] = m; /* stash this for later */ - if (m->trans_quad < MAX_NUMNODES && !node_online(m->trans_quad)) - node_set_online(m->trans_quad); -} - -static int __init mpf_checksum(unsigned char *mp, int len) -{ - int sum = 0; - - while (len--) - sum += *mp++; - - return sum & 0xFF; -} - -/* - * Read/parse the MPC oem tables - */ - -static void __init smp_read_mpc_oem(struct mpc_oemtable *oemtable, - unsigned short oemsize) -{ - int count = sizeof(*oemtable); /* the header size */ - unsigned char *oemptr = ((unsigned char *)oemtable) + count; - - mpc_record = 0; - printk(KERN_INFO "Found an OEM MPC table at %8p - parsing it ... \n", - oemtable); - if (memcmp(oemtable->signature, MPC_OEM_SIGNATURE, 4)) { - printk(KERN_WARNING - "SMP mpc oemtable: bad signature [%c%c%c%c]!\n", - oemtable->signature[0], oemtable->signature[1], - oemtable->signature[2], oemtable->signature[3]); - return; - } - if (mpf_checksum((unsigned char *)oemtable, oemtable->length)) { - printk(KERN_WARNING "SMP oem mptable: checksum error!\n"); - return; - } - while (count < oemtable->length) { - switch (*oemptr) { - case MP_TRANSLATION: - { - struct mpc_config_translation *m = - (struct mpc_config_translation *)oemptr; - MP_translation_info(m); - oemptr += sizeof(*m); - count += sizeof(*m); - ++mpc_record; - break; - } - default: - { - printk(KERN_WARNING - "Unrecognised OEM table entry type! - %d\n", - (int)*oemptr); - return; - } - } - } -} - -static int __init numaq_setup_ioapic_ids(void) -{ - /* so can skip it */ - return 1; -} - -static int __init numaq_update_genapic(void) -{ - genapic->wakeup_cpu = wakeup_secondary_cpu_via_nmi; - - return 0; -} - -static struct x86_quirks numaq_x86_quirks __initdata = { - .arch_pre_time_init = numaq_pre_time_init, - .arch_time_init = NULL, - .arch_pre_intr_init = NULL, - .arch_memory_setup = NULL, - .arch_intr_init = NULL, - .arch_trap_init = NULL, - .mach_get_smp_config = NULL, - .mach_find_smp_config = NULL, - .mpc_record = &mpc_record, - .mpc_apic_id = mpc_apic_id, - .mpc_oem_bus_info = mpc_oem_bus_info, - .mpc_oem_pci_bus = mpc_oem_pci_bus, - .smp_read_mpc_oem = smp_read_mpc_oem, - .setup_ioapic_ids = numaq_setup_ioapic_ids, - .update_genapic = numaq_update_genapic, -}; - -void numaq_mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) -{ - if (strncmp(oem, "IBM NUMA", 8)) - printk("Warning! Not a NUMA-Q system!\n"); - else - found_numaq = 1; -} - -static __init void early_check_numaq(void) -{ - /* - * Find possible boot-time SMP configuration: - */ - early_find_smp_config(); - /* - * get boot-time SMP configuration: - */ - if (smp_found_config) - early_get_smp_config(); - - if (found_numaq) - x86_quirks = &numaq_x86_quirks; -} - -int __init get_memcfg_numaq(void) -{ - early_check_numaq(); - if (!found_numaq) - return 0; - smp_dump_qct(); - return 1; -} Index: linux-2.6-tip/arch/x86/kernel/paravirt-spinlocks.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/paravirt-spinlocks.c +++ linux-2.6-tip/arch/x86/kernel/paravirt-spinlocks.c @@ -8,7 +8,7 @@ #include static inline void -default_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +default_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { __raw_spin_lock(lock); } @@ -26,13 +26,3 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -void __init paravirt_use_bytelocks(void) -{ -#ifdef CONFIG_SMP - pv_lock_ops.spin_is_locked = __byte_spin_is_locked; - pv_lock_ops.spin_is_contended = __byte_spin_is_contended; - pv_lock_ops.spin_lock = __byte_spin_lock; - pv_lock_ops.spin_trylock = __byte_spin_trylock; - pv_lock_ops.spin_unlock = __byte_spin_unlock; -#endif -} Index: linux-2.6-tip/arch/x86/kernel/paravirt.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/paravirt.c +++ linux-2.6-tip/arch/x86/kernel/paravirt.c @@ -28,7 +28,6 @@ #include #include #include -#include #include #include #include @@ -44,6 +43,17 @@ void _paravirt_nop(void) { } +/* identity function, which can be inlined */ +u32 _paravirt_ident_32(u32 x) +{ + return x; +} + +u64 _paravirt_ident_64(u64 x) +{ + return x; +} + static void __init default_banner(void) { printk(KERN_INFO "Booting paravirtualized kernel on %s\n", @@ -138,9 +148,16 @@ unsigned paravirt_patch_default(u8 type, if (opfunc == NULL) /* If there's no function, patch it with a ud2a (BUG) */ ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a)); - else if (opfunc == paravirt_nop) + else if (opfunc == _paravirt_nop) /* If the operation is a nop, then nop the callsite */ ret = paravirt_patch_nop(); + + /* identity functions just return their single argument */ + else if (opfunc == _paravirt_ident_32) + ret = paravirt_patch_ident_32(insnbuf, len); + else if (opfunc == _paravirt_ident_64) + ret = paravirt_patch_ident_64(insnbuf, len); + else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) || type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) || type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) || @@ -318,10 +335,10 @@ struct pv_time_ops pv_time_ops = { struct pv_irq_ops pv_irq_ops = { .init_IRQ = native_init_IRQ, - .save_fl = native_save_fl, - .restore_fl = native_restore_fl, - .irq_disable = native_irq_disable, - .irq_enable = native_irq_enable, + .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), + .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), + .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable), + .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable), .safe_halt = native_safe_halt, .halt = native_halt, #ifdef CONFIG_X86_64 @@ -399,6 +416,28 @@ struct pv_apic_ops pv_apic_ops = { #endif }; +#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) +/* 32-bit pagetable entries */ +#define PTE_IDENT __PV_IS_CALLEE_SAVE(_paravirt_ident_32) +#else +/* 64-bit pagetable entries */ +#define PTE_IDENT __PV_IS_CALLEE_SAVE(_paravirt_ident_64) +#endif + +#ifdef CONFIG_HIGHPTE +/* + * kmap_atomic() might be an inline or a macro: + */ +static void *kmap_atomic_func(struct page *page, enum km_type idx) +{ + return kmap_atomic(page, idx); +} +static void *kmap_atomic_direct_func(struct page *page, enum km_type idx) +{ + return kmap_atomic_direct(page, idx); +} +#endif + struct pv_mmu_ops pv_mmu_ops = { #ifndef CONFIG_X86_64 .pagetable_setup_start = native_pagetable_setup_start, @@ -439,7 +478,8 @@ struct pv_mmu_ops pv_mmu_ops = { .ptep_modify_prot_commit = __ptep_modify_prot_commit, #ifdef CONFIG_HIGHPTE - .kmap_atomic_pte = kmap_atomic, + .kmap_atomic_pte = kmap_atomic_func, + .kmap_atomic_pte_direct = kmap_atomic_direct_func, #endif #if PAGETABLE_LEVELS >= 3 @@ -450,22 +490,23 @@ struct pv_mmu_ops pv_mmu_ops = { .pmd_clear = native_pmd_clear, #endif .set_pud = native_set_pud, - .pmd_val = native_pmd_val, - .make_pmd = native_make_pmd, + + .pmd_val = PTE_IDENT, + .make_pmd = PTE_IDENT, #if PAGETABLE_LEVELS == 4 - .pud_val = native_pud_val, - .make_pud = native_make_pud, + .pud_val = PTE_IDENT, + .make_pud = PTE_IDENT, + .set_pgd = native_set_pgd, #endif #endif /* PAGETABLE_LEVELS >= 3 */ - .pte_val = native_pte_val, - .pte_flags = native_pte_flags, - .pgd_val = native_pgd_val, + .pte_val = PTE_IDENT, + .pgd_val = PTE_IDENT, - .make_pte = native_make_pte, - .make_pgd = native_make_pgd, + .make_pte = PTE_IDENT, + .make_pgd = PTE_IDENT, .dup_mmap = paravirt_nop, .exit_mmap = paravirt_nop, Index: linux-2.6-tip/arch/x86/kernel/paravirt_patch_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/paravirt_patch_32.c +++ linux-2.6-tip/arch/x86/kernel/paravirt_patch_32.c @@ -12,6 +12,18 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %c DEF_NATIVE(pv_cpu_ops, clts, "clts"); DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc"); +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) +{ + /* arg in %eax, return in %eax */ + return 0; +} + +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len) +{ + /* arg in %edx:%eax, return in %edx:%eax */ + return 0; +} + unsigned native_patch(u8 type, u16 clobbers, void *ibuf, unsigned long addr, unsigned len) { Index: linux-2.6-tip/arch/x86/kernel/paravirt_patch_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6-tip/arch/x86/kernel/paravirt_patch_64.c @@ -19,6 +19,21 @@ DEF_NATIVE(pv_cpu_ops, usergs_sysret64, DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl"); DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs"); +DEF_NATIVE(, mov32, "mov %edi, %eax"); +DEF_NATIVE(, mov64, "mov %rdi, %rax"); + +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) +{ + return paravirt_patch_insns(insnbuf, len, + start__mov32, end__mov32); +} + +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len) +{ + return paravirt_patch_insns(insnbuf, len, + start__mov64, end__mov64); +} + unsigned native_patch(u8 type, u16 clobbers, void *ibuf, unsigned long addr, unsigned len) { Index: linux-2.6-tip/arch/x86/kernel/pci-calgary_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/pci-calgary_64.c +++ linux-2.6-tip/arch/x86/kernel/pci-calgary_64.c @@ -380,8 +380,9 @@ static inline struct iommu_table *find_i return tbl; } -static void calgary_unmap_sg(struct device *dev, - struct scatterlist *sglist, int nelems, int direction) +static void calgary_unmap_sg(struct device *dev, struct scatterlist *sglist, + int nelems,enum dma_data_direction dir, + struct dma_attrs *attrs) { struct iommu_table *tbl = find_iommu_table(dev); struct scatterlist *s; @@ -404,7 +405,8 @@ static void calgary_unmap_sg(struct devi } static int calgary_map_sg(struct device *dev, struct scatterlist *sg, - int nelems, int direction) + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct iommu_table *tbl = find_iommu_table(dev); struct scatterlist *s; @@ -429,15 +431,14 @@ static int calgary_map_sg(struct device s->dma_address = (entry << PAGE_SHIFT) | s->offset; /* insert into HW table */ - tce_build(tbl, entry, npages, vaddr & PAGE_MASK, - direction); + tce_build(tbl, entry, npages, vaddr & PAGE_MASK, dir); s->dma_length = s->length; } return nelems; error: - calgary_unmap_sg(dev, sg, nelems, direction); + calgary_unmap_sg(dev, sg, nelems, dir, NULL); for_each_sg(sg, s, nelems, i) { sg->dma_address = bad_dma_address; sg->dma_length = 0; @@ -445,10 +446,12 @@ error: return 0; } -static dma_addr_t calgary_map_single(struct device *dev, phys_addr_t paddr, - size_t size, int direction) +static dma_addr_t calgary_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - void *vaddr = phys_to_virt(paddr); + void *vaddr = page_address(page) + offset; unsigned long uaddr; unsigned int npages; struct iommu_table *tbl = find_iommu_table(dev); @@ -456,17 +459,18 @@ static dma_addr_t calgary_map_single(str uaddr = (unsigned long)vaddr; npages = iommu_num_pages(uaddr, size, PAGE_SIZE); - return iommu_alloc(dev, tbl, vaddr, npages, direction); + return iommu_alloc(dev, tbl, vaddr, npages, dir); } -static void calgary_unmap_single(struct device *dev, dma_addr_t dma_handle, - size_t size, int direction) +static void calgary_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct iommu_table *tbl = find_iommu_table(dev); unsigned int npages; - npages = iommu_num_pages(dma_handle, size, PAGE_SIZE); - iommu_free(tbl, dma_handle, npages); + npages = iommu_num_pages(dma_addr, size, PAGE_SIZE); + iommu_free(tbl, dma_addr, npages); } static void* calgary_alloc_coherent(struct device *dev, size_t size, @@ -515,13 +519,13 @@ static void calgary_free_coherent(struct free_pages((unsigned long)vaddr, get_order(size)); } -static struct dma_mapping_ops calgary_dma_ops = { +static struct dma_map_ops calgary_dma_ops = { .alloc_coherent = calgary_alloc_coherent, .free_coherent = calgary_free_coherent, - .map_single = calgary_map_single, - .unmap_single = calgary_unmap_single, .map_sg = calgary_map_sg, .unmap_sg = calgary_unmap_sg, + .map_page = calgary_map_page, + .unmap_page = calgary_unmap_page, }; static inline void __iomem * busno_to_bbar(unsigned char num) Index: linux-2.6-tip/arch/x86/kernel/pci-dma.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/pci-dma.c +++ linux-2.6-tip/arch/x86/kernel/pci-dma.c @@ -12,7 +12,7 @@ static int forbid_dac __read_mostly; -struct dma_mapping_ops *dma_ops; +struct dma_map_ops *dma_ops; EXPORT_SYMBOL(dma_ops); static int iommu_sac_force __read_mostly; @@ -224,7 +224,7 @@ early_param("iommu", iommu_setup); int dma_supported(struct device *dev, u64 mask) { - struct dma_mapping_ops *ops = get_dma_ops(dev); + struct dma_map_ops *ops = get_dma_ops(dev); #ifdef CONFIG_PCI if (mask > 0xffffffff && forbid_dac > 0) { Index: linux-2.6-tip/arch/x86/kernel/pci-gart_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/pci-gart_64.c +++ linux-2.6-tip/arch/x86/kernel/pci-gart_64.c @@ -255,10 +255,13 @@ static dma_addr_t dma_map_area(struct de } /* Map a single area into the IOMMU */ -static dma_addr_t -gart_map_single(struct device *dev, phys_addr_t paddr, size_t size, int dir) +static dma_addr_t gart_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long bus; + phys_addr_t paddr = page_to_phys(page) + offset; if (!dev) dev = &x86_dma_fallback_dev; @@ -275,8 +278,9 @@ gart_map_single(struct device *dev, phys /* * Free a DMA mapping. */ -static void gart_unmap_single(struct device *dev, dma_addr_t dma_addr, - size_t size, int direction) +static void gart_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { unsigned long iommu_page; int npages; @@ -298,8 +302,8 @@ static void gart_unmap_single(struct dev /* * Wrapper for pci_unmap_single working with scatterlists. */ -static void -gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, int dir) +static void gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *s; int i; @@ -307,7 +311,7 @@ gart_unmap_sg(struct device *dev, struct for_each_sg(sg, s, nents, i) { if (!s->dma_length || !s->length) break; - gart_unmap_single(dev, s->dma_address, s->dma_length, dir); + gart_unmap_page(dev, s->dma_address, s->dma_length, dir, NULL); } } @@ -329,7 +333,7 @@ static int dma_map_sg_nonforce(struct de addr = dma_map_area(dev, addr, s->length, dir, 0); if (addr == bad_dma_address) { if (i > 0) - gart_unmap_sg(dev, sg, i, dir); + gart_unmap_sg(dev, sg, i, dir, NULL); nents = 0; sg[0].dma_length = 0; break; @@ -400,8 +404,8 @@ dma_map_cont(struct device *dev, struct * DMA map all entries in a scatterlist. * Merge chunks that have page aligned sizes into a continuous mapping. */ -static int -gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, int dir) +static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *s, *ps, *start_sg, *sgmap; int need = 0, nextneed, i, out, start; @@ -468,7 +472,7 @@ gart_map_sg(struct device *dev, struct s error: flush_gart(); - gart_unmap_sg(dev, sg, out, dir); + gart_unmap_sg(dev, sg, out, dir, NULL); /* When it was forced or merged try again in a dumb way */ if (force_iommu || iommu_merge) { @@ -521,7 +525,7 @@ static void gart_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_addr) { - gart_unmap_single(dev, dma_addr, size, DMA_BIDIRECTIONAL); + gart_unmap_page(dev, dma_addr, size, DMA_BIDIRECTIONAL, NULL); free_pages((unsigned long)vaddr, get_order(size)); } @@ -707,11 +711,11 @@ static __init int init_k8_gatt(struct ag return -1; } -static struct dma_mapping_ops gart_dma_ops = { - .map_single = gart_map_single, - .unmap_single = gart_unmap_single, +static struct dma_map_ops gart_dma_ops = { .map_sg = gart_map_sg, .unmap_sg = gart_unmap_sg, + .map_page = gart_map_page, + .unmap_page = gart_unmap_page, .alloc_coherent = gart_alloc_coherent, .free_coherent = gart_free_coherent, }; Index: linux-2.6-tip/arch/x86/kernel/pci-nommu.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/pci-nommu.c +++ linux-2.6-tip/arch/x86/kernel/pci-nommu.c @@ -25,19 +25,19 @@ check_addr(char *name, struct device *hw return 1; } -static dma_addr_t -nommu_map_single(struct device *hwdev, phys_addr_t paddr, size_t size, - int direction) +static dma_addr_t nommu_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - dma_addr_t bus = paddr; + dma_addr_t bus = page_to_phys(page) + offset; WARN_ON(size == 0); - if (!check_addr("map_single", hwdev, bus, size)) - return bad_dma_address; + if (!check_addr("map_single", dev, bus, size)) + return bad_dma_address; flush_write_buffers(); return bus; } - /* Map a set of buffers described by scatterlist in streaming * mode for DMA. This is the scatter-gather version of the * above pci_map_single interface. Here the scatter gather list @@ -54,7 +54,8 @@ nommu_map_single(struct device *hwdev, p * the same here. */ static int nommu_map_sg(struct device *hwdev, struct scatterlist *sg, - int nents, int direction) + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct scatterlist *s; int i; @@ -78,11 +79,11 @@ static void nommu_free_coherent(struct d free_pages((unsigned long)vaddr, get_order(size)); } -struct dma_mapping_ops nommu_dma_ops = { +struct dma_map_ops nommu_dma_ops = { .alloc_coherent = dma_generic_alloc_coherent, .free_coherent = nommu_free_coherent, - .map_single = nommu_map_single, .map_sg = nommu_map_sg, + .map_page = nommu_map_page, .is_phys = 1, }; Index: linux-2.6-tip/arch/x86/kernel/pci-swiotlb.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/kernel/pci-swiotlb.c @@ -0,0 +1,84 @@ +/* Glue code to lib/swiotlb.c */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +int swiotlb __read_mostly; + +void * __init swiotlb_alloc_boot(size_t size, unsigned long nslabs) +{ + return alloc_bootmem_low_pages(size); +} + +void *swiotlb_alloc(unsigned order, unsigned long nslabs) +{ + return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order); +} + +dma_addr_t swiotlb_phys_to_bus(struct device *hwdev, phys_addr_t paddr) +{ + return paddr; +} + +phys_addr_t swiotlb_bus_to_phys(dma_addr_t baddr) +{ + return baddr; +} + +int __weak swiotlb_arch_range_needs_mapping(phys_addr_t paddr, size_t size) +{ + return 0; +} + +static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size, + dma_addr_t *dma_handle, gfp_t flags) +{ + void *vaddr; + + vaddr = dma_generic_alloc_coherent(hwdev, size, dma_handle, flags); + if (vaddr) + return vaddr; + + return swiotlb_alloc_coherent(hwdev, size, dma_handle, flags); +} + +struct dma_map_ops swiotlb_dma_ops = { + .mapping_error = swiotlb_dma_mapping_error, + .alloc_coherent = x86_swiotlb_alloc_coherent, + .free_coherent = swiotlb_free_coherent, + .sync_single_for_cpu = swiotlb_sync_single_for_cpu, + .sync_single_for_device = swiotlb_sync_single_for_device, + .sync_single_range_for_cpu = swiotlb_sync_single_range_for_cpu, + .sync_single_range_for_device = swiotlb_sync_single_range_for_device, + .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, + .sync_sg_for_device = swiotlb_sync_sg_for_device, + .map_sg = swiotlb_map_sg_attrs, + .unmap_sg = swiotlb_unmap_sg_attrs, + .map_page = swiotlb_map_page, + .unmap_page = swiotlb_unmap_page, + .dma_supported = NULL, +}; + +void __init pci_swiotlb_init(void) +{ + /* don't initialize swiotlb if iommu=off (no_iommu=1) */ +#ifdef CONFIG_X86_64 + if (!iommu_detected && !no_iommu && max_pfn > MAX_DMA32_PFN) + swiotlb = 1; +#endif + if (swiotlb_force) + swiotlb = 1; + if (swiotlb) { + printk(KERN_INFO "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n"); + swiotlb_init(); + dma_ops = &swiotlb_dma_ops; + } +} Index: linux-2.6-tip/arch/x86/kernel/pci-swiotlb_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/pci-swiotlb_64.c +++ /dev/null @@ -1,91 +0,0 @@ -/* Glue code to lib/swiotlb.c */ - -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -int swiotlb __read_mostly; - -void * __init swiotlb_alloc_boot(size_t size, unsigned long nslabs) -{ - return alloc_bootmem_low_pages(size); -} - -void *swiotlb_alloc(unsigned order, unsigned long nslabs) -{ - return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order); -} - -dma_addr_t swiotlb_phys_to_bus(struct device *hwdev, phys_addr_t paddr) -{ - return paddr; -} - -phys_addr_t swiotlb_bus_to_phys(dma_addr_t baddr) -{ - return baddr; -} - -int __weak swiotlb_arch_range_needs_mapping(void *ptr, size_t size) -{ - return 0; -} - -static dma_addr_t -swiotlb_map_single_phys(struct device *hwdev, phys_addr_t paddr, size_t size, - int direction) -{ - return swiotlb_map_single(hwdev, phys_to_virt(paddr), size, direction); -} - -static void *x86_swiotlb_alloc_coherent(struct device *hwdev, size_t size, - dma_addr_t *dma_handle, gfp_t flags) -{ - void *vaddr; - - vaddr = dma_generic_alloc_coherent(hwdev, size, dma_handle, flags); - if (vaddr) - return vaddr; - - return swiotlb_alloc_coherent(hwdev, size, dma_handle, flags); -} - -struct dma_mapping_ops swiotlb_dma_ops = { - .mapping_error = swiotlb_dma_mapping_error, - .alloc_coherent = x86_swiotlb_alloc_coherent, - .free_coherent = swiotlb_free_coherent, - .map_single = swiotlb_map_single_phys, - .unmap_single = swiotlb_unmap_single, - .sync_single_for_cpu = swiotlb_sync_single_for_cpu, - .sync_single_for_device = swiotlb_sync_single_for_device, - .sync_single_range_for_cpu = swiotlb_sync_single_range_for_cpu, - .sync_single_range_for_device = swiotlb_sync_single_range_for_device, - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, - .sync_sg_for_device = swiotlb_sync_sg_for_device, - .map_sg = swiotlb_map_sg, - .unmap_sg = swiotlb_unmap_sg, - .dma_supported = NULL, -}; - -void __init pci_swiotlb_init(void) -{ - /* don't initialize swiotlb if iommu=off (no_iommu=1) */ -#ifdef CONFIG_X86_64 - if (!iommu_detected && !no_iommu && max_pfn > MAX_DMA32_PFN) - swiotlb = 1; -#endif - if (swiotlb_force) - swiotlb = 1; - if (swiotlb) { - printk(KERN_INFO "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n"); - swiotlb_init(); - dma_ops = &swiotlb_dma_ops; - } -} Index: linux-2.6-tip/arch/x86/kernel/probe_roms_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/probe_roms_32.c +++ linux-2.6-tip/arch/x86/kernel/probe_roms_32.c @@ -18,7 +18,7 @@ #include #include #include -#include +#include static struct resource system_rom_resource = { .name = "System ROM", Index: linux-2.6-tip/arch/x86/kernel/process.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/process.c +++ linux-2.6-tip/arch/x86/kernel/process.c @@ -8,7 +8,7 @@ #include #include #include -#include +#include #include #include @@ -19,6 +19,9 @@ EXPORT_SYMBOL(idle_nomwait); struct kmem_cache *task_xstate_cachep; +DEFINE_TRACE(power_start); +DEFINE_TRACE(power_end); + int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) { *dst = *src; @@ -52,7 +55,7 @@ void arch_task_cache_init(void) task_xstate_cachep = kmem_cache_create("task_xstate", xstate_size, __alignof__(union thread_xstate), - SLAB_PANIC, NULL); + SLAB_PANIC | SLAB_NOTRACK, NULL); } /* @@ -350,7 +353,7 @@ static void c1e_idle(void) void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) { -#ifdef CONFIG_X86_SMP +#ifdef CONFIG_SMP if (pm_idle == poll_idle && smp_num_siblings > 1) { printk(KERN_WARNING "WARNING: polling idle and HT enabled," " performance may degrade.\n"); Index: linux-2.6-tip/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/process_32.c +++ linux-2.6-tip/arch/x86/kernel/process_32.c @@ -9,8 +9,7 @@ * This file handles the architecture-dependent parts of process handling.. */ -#include - +#include #include #include #include @@ -66,9 +65,6 @@ asmlinkage void ret_from_fork(void) __as DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; EXPORT_PER_CPU_SYMBOL(current_task); -DEFINE_PER_CPU(int, cpu_number); -EXPORT_PER_CPU_SYMBOL(cpu_number); - /* * Return saved PC of a blocked thread. */ @@ -94,6 +90,15 @@ void cpu_idle(void) { int cpu = smp_processor_id(); + /* + * If we're the non-boot CPU, nothing set the stack canary up + * for us. CPU0 already has it initialized but no harm in + * doing it again. This is a good place for updating it, as + * we wont ever return from this function (so the invalid + * canaries already on the stack wont ever trigger). + */ + boot_init_stack_canary(); + current_thread_info()->status |= TS_POLLING; /* endless idle loop with no priority at all */ @@ -101,23 +106,23 @@ void cpu_idle(void) tick_nohz_stop_sched_tick(1); while (!need_resched()) { - check_pgt_cache(); rmb(); if (cpu_is_offline(cpu)) play_dead(); local_irq_disable(); - __get_cpu_var(irq_stat).idle_timestamp = jiffies; /* Don't trace irqs off for idle */ stop_critical_timings(); pm_idle(); start_critical_timings(); } + local_irq_disable(); tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } @@ -132,7 +137,7 @@ void __show_regs(struct pt_regs *regs, i if (user_mode_vm(regs)) { sp = regs->sp; ss = regs->ss & 0xffff; - savesegment(gs, gs); + gs = get_user_gs(regs); } else { sp = (unsigned long) (®s->sp); savesegment(ss, ss); @@ -159,8 +164,10 @@ void __show_regs(struct pt_regs *regs, i regs->ax, regs->bx, regs->cx, regs->dx); printk("ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n", regs->si, regs->di, regs->bp, sp); - printk(" DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x\n", - (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss); + printk(" DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x" + " preempt:%08x\n", + (u16)regs->ds, (u16)regs->es, (u16)regs->fs, gs, ss, + preempt_count()); if (!all) return; @@ -213,6 +220,7 @@ int kernel_thread(int (*fn)(void *), voi regs.ds = __USER_DS; regs.es = __USER_DS; regs.fs = __KERNEL_PERCPU; + regs.gs = __KERNEL_STACK_CANARY; regs.orig_ax = -1; regs.ip = (unsigned long) kernel_thread_helper; regs.cs = __KERNEL_CS | get_kernel_rpl(); @@ -232,15 +240,23 @@ void exit_thread(void) if (unlikely(test_thread_flag(TIF_IO_BITMAP))) { struct task_struct *tsk = current; struct thread_struct *t = &tsk->thread; - int cpu = get_cpu(); - struct tss_struct *tss = &per_cpu(init_tss, cpu); + void *io_bitmap_ptr = t->io_bitmap_ptr; + int cpu; + struct tss_struct *tss; - kfree(t->io_bitmap_ptr); + /* + * On PREEMPT_RT we must not call kfree() with + * preemption disabled, so we first zap the pointer: + */ t->io_bitmap_ptr = NULL; + kfree(io_bitmap_ptr); + clear_thread_flag(TIF_IO_BITMAP); /* * Careful, clear this in the TSS too: */ + cpu = get_cpu(); + tss = &per_cpu(init_tss, cpu); memset(tss->io_bitmap, 0xff, tss->io_bitmap_max); t->io_bitmap_max = 0; tss->io_bitmap_owner = NULL; @@ -305,7 +321,7 @@ int copy_thread(int nr, unsigned long cl p->thread.ip = (unsigned long) ret_from_fork; - savesegment(gs, p->thread.gs); + task_user_gs(p) = get_user_gs(regs); tsk = current; if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) { @@ -343,7 +359,7 @@ int copy_thread(int nr, unsigned long cl void start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp) { - __asm__("movl %0, %%gs" : : "r"(0)); + set_user_gs(regs, 0); regs->fs = 0; set_fs(USER_DS); regs->ds = __USER_DS; @@ -540,7 +556,7 @@ __switch_to(struct task_struct *prev_p, * used %fs or %gs (it does not today), or if the kernel is * running inside of a hypervisor layer. */ - savesegment(gs, prev->gs); + lazy_save_gs(prev->gs); /* * Load the per-thread Thread-Local Storage descriptor. @@ -586,31 +602,31 @@ __switch_to(struct task_struct *prev_p, * Restore %gs if needed (which is common) */ if (prev->gs | next->gs) - loadsegment(gs, next->gs); + lazy_load_gs(next->gs); - x86_write_percpu(current_task, next_p); + percpu_write(current_task, next_p); return prev_p; } -asmlinkage int sys_fork(struct pt_regs regs) +int sys_fork(struct pt_regs *regs) { - return do_fork(SIGCHLD, regs.sp, ®s, 0, NULL, NULL); + return do_fork(SIGCHLD, regs->sp, regs, 0, NULL, NULL); } -asmlinkage int sys_clone(struct pt_regs regs) +int sys_clone(struct pt_regs *regs) { unsigned long clone_flags; unsigned long newsp; int __user *parent_tidptr, *child_tidptr; - clone_flags = regs.bx; - newsp = regs.cx; - parent_tidptr = (int __user *)regs.dx; - child_tidptr = (int __user *)regs.di; + clone_flags = regs->bx; + newsp = regs->cx; + parent_tidptr = (int __user *)regs->dx; + child_tidptr = (int __user *)regs->di; if (!newsp) - newsp = regs.sp; - return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr); + newsp = regs->sp; + return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } /* @@ -623,27 +639,27 @@ asmlinkage int sys_clone(struct pt_regs * do not have enough call-clobbered registers to hold all * the information you need. */ -asmlinkage int sys_vfork(struct pt_regs regs) +int sys_vfork(struct pt_regs *regs) { - return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.sp, ®s, 0, NULL, NULL); + return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->sp, regs, 0, NULL, NULL); } /* * sys_execve() executes a new program. */ -asmlinkage int sys_execve(struct pt_regs regs) +int sys_execve(struct pt_regs *regs) { int error; char *filename; - filename = getname((char __user *) regs.bx); + filename = getname((char __user *) regs->bx); error = PTR_ERR(filename); if (IS_ERR(filename)) goto out; error = do_execve(filename, - (char __user * __user *) regs.cx, - (char __user * __user *) regs.dx, - ®s); + (char __user * __user *) regs->cx, + (char __user * __user *) regs->dx, + regs); if (error == 0) { /* Make sure we don't return using sysenter.. */ set_thread_flag(TIF_IRET); Index: linux-2.6-tip/arch/x86/kernel/process_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/process_64.c +++ linux-2.6-tip/arch/x86/kernel/process_64.c @@ -16,6 +16,7 @@ #include +#include #include #include #include @@ -47,7 +48,6 @@ #include #include #include -#include #include #include #include @@ -58,6 +58,12 @@ asmlinkage extern void ret_from_fork(void); +DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; +EXPORT_PER_CPU_SYMBOL(current_task); + +DEFINE_PER_CPU(unsigned long, old_rsp); +static DEFINE_PER_CPU(unsigned char, is_idle); + unsigned long kernel_thread_flags = CLONE_VM | CLONE_UNTRACED; static ATOMIC_NOTIFIER_HEAD(idle_notifier); @@ -76,13 +82,13 @@ EXPORT_SYMBOL_GPL(idle_notifier_unregist void enter_idle(void) { - write_pda(isidle, 1); + percpu_write(is_idle, 1); atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL); } static void __exit_idle(void) { - if (test_and_clear_bit_pda(0, isidle) == 0) + if (x86_test_and_clear_bit_percpu(0, is_idle) == 0) return; atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL); } @@ -112,6 +118,16 @@ static inline void play_dead(void) void cpu_idle(void) { current_thread_info()->status |= TS_POLLING; + + /* + * If we're the non-boot CPU, nothing set the stack canary up + * for us. CPU0 already has it initialized but no harm in + * doing it again. This is a good place for updating it, as + * we wont ever return from this function (so the invalid + * canaries already on the stack wont ever trigger). + */ + boot_init_stack_canary(); + /* endless idle loop with no priority at all */ while (1) { tick_nohz_stop_sched_tick(1); @@ -139,9 +155,11 @@ void cpu_idle(void) } tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } @@ -230,7 +248,7 @@ void exit_thread(void) struct thread_struct *t = &me->thread; if (me->thread.io_bitmap_ptr) { - struct tss_struct *tss = &per_cpu(init_tss, get_cpu()); + struct tss_struct *tss; kfree(t->io_bitmap_ptr); t->io_bitmap_ptr = NULL; @@ -238,6 +256,7 @@ void exit_thread(void) /* * Careful, clear this in the TSS too: */ + tss = &per_cpu(init_tss, get_cpu()); memset(tss->io_bitmap, 0xff, t->io_bitmap_max); t->io_bitmap_max = 0; put_cpu(); @@ -397,7 +416,7 @@ start_thread(struct pt_regs *regs, unsig load_gs_index(0); regs->ip = new_ip; regs->sp = new_sp; - write_pda(oldrsp, new_sp); + percpu_write(old_rsp, new_sp); regs->cs = __USER_CS; regs->ss = __USER_DS; regs->flags = 0x200; @@ -618,21 +637,13 @@ __switch_to(struct task_struct *prev_p, /* * Switch the PDA and FPU contexts. */ - prev->usersp = read_pda(oldrsp); - write_pda(oldrsp, next->usersp); - write_pda(pcurrent, next_p); + prev->usersp = percpu_read(old_rsp); + percpu_write(old_rsp, next->usersp); + percpu_write(current_task, next_p); - write_pda(kernelstack, + percpu_write(kernel_stack, (unsigned long)task_stack_page(next_p) + - THREAD_SIZE - PDA_STACKOFFSET); -#ifdef CONFIG_CC_STACKPROTECTOR - write_pda(stack_canary, next_p->stack_canary); - /* - * Build time only check to make sure the stack_canary is at - * offset 40 in the pda; this is a gcc ABI requirement - */ - BUILD_BUG_ON(offsetof(struct x8664_pda, stack_canary) != 40); -#endif + THREAD_SIZE - KERNEL_STACK_OFFSET); /* * Now maybe reload the debug registers and handle I/O bitmaps Index: linux-2.6-tip/arch/x86/kernel/ptrace.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/ptrace.c +++ linux-2.6-tip/arch/x86/kernel/ptrace.c @@ -75,10 +75,7 @@ static inline bool invalid_selector(u16 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno) { BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0); - regno >>= 2; - if (regno > FS) - --regno; - return ®s->bx + regno; + return ®s->bx + (regno >> 2); } static u16 get_segment_reg(struct task_struct *task, unsigned long offset) @@ -90,9 +87,10 @@ static u16 get_segment_reg(struct task_s if (offset != offsetof(struct user_regs_struct, gs)) retval = *pt_regs_access(task_pt_regs(task), offset); else { - retval = task->thread.gs; if (task == current) - savesegment(gs, retval); + retval = get_user_gs(task_pt_regs(task)); + else + retval = task_user_gs(task); } return retval; } @@ -126,13 +124,10 @@ static int set_segment_reg(struct task_s break; case offsetof(struct user_regs_struct, gs): - task->thread.gs = value; if (task == current) - /* - * The user-mode %gs is not affected by - * kernel entry, so we must update the CPU. - */ - loadsegment(gs, value); + set_user_gs(task_pt_regs(task), value); + else + task_user_gs(task) = value; } return 0; @@ -273,7 +268,7 @@ static unsigned long debugreg_addr_limit if (test_tsk_thread_flag(task, TIF_IA32)) return IA32_PAGE_OFFSET - 3; #endif - return TASK_SIZE64 - 7; + return TASK_SIZE_MAX - 7; } #endif /* CONFIG_X86_32 */ Index: linux-2.6-tip/arch/x86/kernel/quirks.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/quirks.c +++ linux-2.6-tip/arch/x86/kernel/quirks.c @@ -172,7 +172,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I ich_force_enable_hpet); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7, ich_force_enable_hpet); - +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3a16, /* ICH10 */ + ich_force_enable_hpet); static struct pci_dev *cached_dev; Index: linux-2.6-tip/arch/x86/kernel/reboot.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/reboot.c +++ linux-2.6-tip/arch/x86/kernel/reboot.c @@ -14,6 +14,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 # include @@ -23,8 +24,6 @@ # include #endif -#include - /* * Power off function, if any */ @@ -650,7 +649,7 @@ static int crash_nmi_callback(struct not static void smp_send_nmi_allbutself(void) { - send_IPI_allbutself(NMI_VECTOR); + apic->send_IPI_allbutself(NMI_VECTOR); } static struct notifier_block crash_nmi_nb = { Index: linux-2.6-tip/arch/x86/kernel/relocate_kernel_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/relocate_kernel_32.S +++ linux-2.6-tip/arch/x86/kernel/relocate_kernel_32.S @@ -7,7 +7,7 @@ */ #include -#include +#include #include #include Index: linux-2.6-tip/arch/x86/kernel/relocate_kernel_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/relocate_kernel_64.S +++ linux-2.6-tip/arch/x86/kernel/relocate_kernel_64.S @@ -7,10 +7,10 @@ */ #include -#include +#include #include #include -#include +#include /* * Must be relocatable PIC code callable as a C function @@ -29,122 +29,6 @@ relocate_kernel: * %rdx start address */ - /* map the control page at its virtual address */ - - movq $0x0000ff8000000000, %r10 /* mask */ - mov $(39 - 3), %cl /* bits to shift */ - movq PTR(VA_CONTROL_PAGE)(%rsi), %r11 /* address to map */ - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PGD)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PUD_0)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PUD_0)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PMD_0)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PMD_0)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PTE_0)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PTE_0)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - /* identity map the control page at its physical address */ - - movq $0x0000ff8000000000, %r10 /* mask */ - mov $(39 - 3), %cl /* bits to shift */ - movq PTR(PA_CONTROL_PAGE)(%rsi), %r11 /* address to map */ - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PGD)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PUD_1)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PUD_1)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PMD_1)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PMD_1)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_PTE_1)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - - shrq $9, %r10 - sub $9, %cl - - movq %r11, %r9 - andq %r10, %r9 - shrq %cl, %r9 - - movq PTR(VA_PTE_1)(%rsi), %r8 - addq %r8, %r9 - movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 - orq $PAGE_ATTR, %r8 - movq %r8, (%r9) - -relocate_new_kernel: - /* %rdi indirection_page - * %rsi page_list - * %rdx start address - */ - /* zero out flags, and disable interrupts */ pushq $0 popfq @@ -156,9 +40,8 @@ relocate_new_kernel: /* get physical address of page table now too */ movq PTR(PA_TABLE_PAGE)(%rsi), %rcx - /* switch to new set of page tables */ - movq PTR(PA_PGD)(%rsi), %r9 - movq %r9, %cr3 + /* Switch to the identity mapped page tables */ + movq %rcx, %cr3 /* setup a new stack at the end of the physical control page */ lea PAGE_SIZE(%r8), %rsp @@ -194,9 +77,7 @@ identity_mapped: jmp 1f 1: - /* Switch to the identity mapped page tables, - * and flush the TLB. - */ + /* Flush the TLB (needed?) */ movq %rcx, %cr3 /* Do the copies */ Index: linux-2.6-tip/arch/x86/kernel/scx200_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/scx200_32.c +++ linux-2.6-tip/arch/x86/kernel/scx200_32.c @@ -78,8 +78,10 @@ static int __devinit scx200_probe(struct if (scx200_cb_probe(SCx200_CB_BASE_FIXED)) { scx200_cb_base = SCx200_CB_BASE_FIXED; } else { - pci_read_config_dword(pdev, SCx200_CBA_SCRATCH, &base); - if (scx200_cb_probe(base)) { + int err; + + err = pci_read_config_dword(pdev, SCx200_CBA_SCRATCH, &base); + if (!err && scx200_cb_probe(base)) { scx200_cb_base = base; } else { printk(KERN_WARNING NAME ": Configuration Block not found\n"); Index: linux-2.6-tip/arch/x86/kernel/setup.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/setup.c +++ linux-2.6-tip/arch/x86/kernel/setup.c @@ -74,14 +74,15 @@ #include #include #include -#include #include +#include +#include #include #include #include #include #include -#include +#include #include #include #include @@ -89,7 +90,7 @@ #include #include -#include +#include #include #include #include @@ -97,7 +98,6 @@ #include #include -#include #include #include @@ -112,6 +112,20 @@ #define ARCH_SETUP #endif +unsigned int boot_cpu_id __read_mostly; + +#ifdef CONFIG_X86_64 +int default_cpu_present_to_apicid(int mps_cpu) +{ + return __default_cpu_present_to_apicid(mps_cpu); +} + +int default_check_phys_apicid_present(int boot_cpu_physical_apicid) +{ + return __default_check_phys_apicid_present(boot_cpu_physical_apicid); +} +#endif + #ifndef CONFIG_DEBUG_BOOT_PARAMS struct boot_params __initdata boot_params; #else @@ -586,19 +600,18 @@ static int __init setup_elfcorehdr(char early_param("elfcorehdr", setup_elfcorehdr); #endif -static int __init default_update_genapic(void) +static int __init default_update_apic(void) { -#ifdef CONFIG_X86_SMP -# if defined(CONFIG_X86_GENERICARCH) || defined(CONFIG_X86_64) - genapic->wakeup_cpu = wakeup_secondary_cpu_via_init; -# endif +#ifdef CONFIG_SMP + if (!apic->wakeup_cpu) + apic->wakeup_cpu = wakeup_secondary_cpu_via_init; #endif return 0; } static struct x86_quirks default_x86_quirks __initdata = { - .update_genapic = default_update_genapic, + .update_apic = default_update_apic, }; struct x86_quirks *x86_quirks __initdata = &default_x86_quirks; @@ -656,7 +669,6 @@ void __init setup_arch(char **cmdline_p) #ifdef CONFIG_X86_32 memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data)); visws_early_detect(); - pre_setup_arch_hook(); #else printk(KERN_INFO "Command line: %s\n", boot_command_line); #endif @@ -823,8 +835,7 @@ void __init setup_arch(char **cmdline_p) #else num_physpages = max_pfn; - if (cpu_has_x2apic) - check_x2apic(); + check_x2apic(); /* How many end-of-memory variables you have, grandma! */ /* need this before calling reserve_initrd */ @@ -892,12 +903,11 @@ void __init setup_arch(char **cmdline_p) */ acpi_reserve_bootmem(); #endif -#ifdef CONFIG_X86_FIND_SMP_CONFIG /* * Find and reserve possible boot-time SMP configuration: */ find_smp_config(); -#endif + reserve_crashkernel(); #ifdef CONFIG_X86_64 @@ -924,9 +934,7 @@ void __init setup_arch(char **cmdline_p) map_vsyscall(); #endif -#ifdef CONFIG_X86_GENERICARCH generic_apic_probe(); -#endif early_quirks(); @@ -977,4 +985,95 @@ void __init setup_arch(char **cmdline_p) #endif } +#ifdef CONFIG_X86_32 + +/** + * x86_quirk_pre_intr_init - initialisation prior to setting up interrupt vectors + * + * Description: + * Perform any necessary interrupt initialisation prior to setting up + * the "ordinary" interrupt call gates. For legacy reasons, the ISA + * interrupts should be initialised here if the machine emulates a PC + * in any way. + **/ +void __init x86_quirk_pre_intr_init(void) +{ + if (x86_quirks->arch_pre_intr_init) { + if (x86_quirks->arch_pre_intr_init()) + return; + } + init_ISA_irqs(); +} + +/** + * x86_quirk_intr_init - post gate setup interrupt initialisation + * + * Description: + * Fill in any interrupts that may have been left out by the general + * init_IRQ() routine. interrupts having to do with the machine rather + * than the devices on the I/O bus (like APIC interrupts in intel MP + * systems) are started here. + **/ +void __init x86_quirk_intr_init(void) +{ + if (x86_quirks->arch_intr_init) { + if (x86_quirks->arch_intr_init()) + return; + } +} + +/** + * x86_quirk_trap_init - initialise system specific traps + * + * Description: + * Called as the final act of trap_init(). Used in VISWS to initialise + * the various board specific APIC traps. + **/ +void __init x86_quirk_trap_init(void) +{ + if (x86_quirks->arch_trap_init) { + if (x86_quirks->arch_trap_init()) + return; + } +} + +static struct irqaction irq0 = { + .handler = timer_interrupt, + .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER, + .mask = CPU_MASK_NONE, + .name = "timer" +}; +/** + * x86_quirk_pre_time_init - do any specific initialisations before. + * + **/ +void __init x86_quirk_pre_time_init(void) +{ + if (x86_quirks->arch_pre_time_init) + x86_quirks->arch_pre_time_init(); +} + +/** + * x86_quirk_time_init - do any specific initialisations for the system timer. + * + * Description: + * Must plug the system timer interrupt source at HZ into the IRQ listed + * in irq_vectors.h:TIMER_IRQ + **/ +void __init x86_quirk_time_init(void) +{ + if (x86_quirks->arch_time_init) { + /* + * A nonzero return code does not mean failure, it means + * that the architecture quirk does not want any + * generic (timer) setup to be performed after this: + */ + if (x86_quirks->arch_time_init()) + return; + } + + irq0.mask = cpumask_of_cpu(0); + setup_irq(0, &irq0); +} +#endif /* CONFIG_X86_32 */ Index: linux-2.6-tip/arch/x86/kernel/setup_percpu.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/setup_percpu.c +++ linux-2.6-tip/arch/x86/kernel/setup_percpu.c @@ -13,145 +13,47 @@ #include #include #include +#include +#include +#include +#include -#ifdef CONFIG_X86_LOCAL_APIC -unsigned int num_processors; -unsigned disabled_cpus __cpuinitdata; -/* Processor that is doing the boot up */ -unsigned int boot_cpu_physical_apicid = -1U; -EXPORT_SYMBOL(boot_cpu_physical_apicid); -unsigned int max_physical_apicid; - -/* Bitmask of physically existing CPUs */ -physid_mask_t phys_cpu_present_map; +#ifdef CONFIG_DEBUG_PER_CPU_MAPS +# define DBG(x...) printk(KERN_DEBUG x) +#else +# define DBG(x...) #endif -/* map cpu index to physical APIC ID */ -DEFINE_EARLY_PER_CPU(u16, x86_cpu_to_apicid, BAD_APICID); -DEFINE_EARLY_PER_CPU(u16, x86_bios_cpu_apicid, BAD_APICID); -EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid); -EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); - -#if defined(CONFIG_NUMA) && defined(CONFIG_X86_64) -#define X86_64_NUMA 1 - -/* map cpu index to node index */ -DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE); -EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map); - -/* which logical CPUs are on which nodes */ -cpumask_t *node_to_cpumask_map; -EXPORT_SYMBOL(node_to_cpumask_map); - -/* setup node_to_cpumask_map */ -static void __init setup_node_to_cpumask_map(void); +DEFINE_PER_CPU(int, cpu_number); +EXPORT_PER_CPU_SYMBOL(cpu_number); +#ifdef CONFIG_X86_64 +#define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load) #else -static inline void setup_node_to_cpumask_map(void) { } +#define BOOT_PERCPU_OFFSET 0 #endif -#if defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) && defined(CONFIG_X86_SMP) -/* - * Copy data used in early init routines from the initial arrays to the - * per cpu data areas. These arrays then become expendable and the - * *_early_ptr's are zeroed indicating that the static arrays are gone. - */ -static void __init setup_per_cpu_maps(void) -{ - int cpu; +DEFINE_PER_CPU(unsigned long, this_cpu_off) = BOOT_PERCPU_OFFSET; +EXPORT_PER_CPU_SYMBOL(this_cpu_off); - for_each_possible_cpu(cpu) { - per_cpu(x86_cpu_to_apicid, cpu) = - early_per_cpu_map(x86_cpu_to_apicid, cpu); - per_cpu(x86_bios_cpu_apicid, cpu) = - early_per_cpu_map(x86_bios_cpu_apicid, cpu); -#ifdef X86_64_NUMA - per_cpu(x86_cpu_to_node_map, cpu) = - early_per_cpu_map(x86_cpu_to_node_map, cpu); -#endif - } - - /* indicate the early static arrays will soon be gone */ - early_per_cpu_ptr(x86_cpu_to_apicid) = NULL; - early_per_cpu_ptr(x86_bios_cpu_apicid) = NULL; -#ifdef X86_64_NUMA - early_per_cpu_ptr(x86_cpu_to_node_map) = NULL; -#endif -} - -#ifdef CONFIG_X86_32 -/* - * Great future not-so-futuristic plan: make i386 and x86_64 do it - * the same way - */ -unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; +unsigned long __per_cpu_offset[NR_CPUS] __read_mostly = { + [0 ... NR_CPUS-1] = BOOT_PERCPU_OFFSET, +}; EXPORT_SYMBOL(__per_cpu_offset); -static inline void setup_cpu_pda_map(void) { } - -#elif !defined(CONFIG_SMP) -static inline void setup_cpu_pda_map(void) { } - -#else /* CONFIG_SMP && CONFIG_X86_64 */ -/* - * Allocate cpu_pda pointer table and array via alloc_bootmem. - */ -static void __init setup_cpu_pda_map(void) +static inline void setup_percpu_segment(int cpu) { - char *pda; - struct x8664_pda **new_cpu_pda; - unsigned long size; - int cpu; - - size = roundup(sizeof(struct x8664_pda), cache_line_size()); - - /* allocate cpu_pda array and pointer table */ - { - unsigned long tsize = nr_cpu_ids * sizeof(void *); - unsigned long asize = size * (nr_cpu_ids - 1); - - tsize = roundup(tsize, cache_line_size()); - new_cpu_pda = alloc_bootmem(tsize + asize); - pda = (char *)new_cpu_pda + tsize; - } - - /* initialize pointer table to static pda's */ - for_each_possible_cpu(cpu) { - if (cpu == 0) { - /* leave boot cpu pda in place */ - new_cpu_pda[0] = cpu_pda(0); - continue; - } - new_cpu_pda[cpu] = (struct x8664_pda *)pda; - new_cpu_pda[cpu]->in_bootmem = 1; - pda += size; - } - - /* point to new pointer table */ - _cpu_pda = new_cpu_pda; -} - -#endif /* CONFIG_SMP && CONFIG_X86_64 */ - -#ifdef CONFIG_X86_64 - -/* correctly size the local cpu masks */ -static void __init setup_cpu_local_masks(void) -{ - alloc_bootmem_cpumask_var(&cpu_initialized_mask); - alloc_bootmem_cpumask_var(&cpu_callin_mask); - alloc_bootmem_cpumask_var(&cpu_callout_mask); - alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask); -} - -#else /* CONFIG_X86_32 */ +#ifdef CONFIG_X86_32 + struct desc_struct gdt; -static inline void setup_cpu_local_masks(void) -{ + pack_descriptor(&gdt, per_cpu_offset(cpu), 0xFFFFF, + 0x2 | DESCTYPE_S, 0x8); + gdt.s = 1; + write_gdt_entry(get_cpu_gdt_table(cpu), + GDT_ENTRY_PERCPU, &gdt, DESCTYPE_S); +#endif } -#endif /* CONFIG_X86_32 */ - /* * Great future plan: * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data. @@ -159,18 +61,12 @@ static inline void setup_cpu_local_masks */ void __init setup_per_cpu_areas(void) { - ssize_t size, old_size; + ssize_t size; char *ptr; int cpu; - unsigned long align = 1; - - /* Setup cpu_pda map */ - setup_cpu_pda_map(); /* Copy section for each CPU (we discard the original) */ - old_size = PERCPU_ENOUGH_ROOM; - align = max_t(unsigned long, PAGE_SIZE, align); - size = roundup(old_size, align); + size = roundup(PERCPU_ENOUGH_ROOM, PAGE_SIZE); pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n", NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids); @@ -179,30 +75,68 @@ void __init setup_per_cpu_areas(void) for_each_possible_cpu(cpu) { #ifndef CONFIG_NEED_MULTIPLE_NODES - ptr = __alloc_bootmem(size, align, - __pa(MAX_DMA_ADDRESS)); + ptr = alloc_bootmem_pages(size); #else int node = early_cpu_to_node(cpu); if (!node_online(node) || !NODE_DATA(node)) { - ptr = __alloc_bootmem(size, align, - __pa(MAX_DMA_ADDRESS)); + ptr = alloc_bootmem_pages(size); pr_info("cpu %d has no node %d or node-local memory\n", cpu, node); pr_debug("per cpu data for cpu%d at %016lx\n", cpu, __pa(ptr)); } else { - ptr = __alloc_bootmem_node(NODE_DATA(node), size, align, - __pa(MAX_DMA_ADDRESS)); + ptr = alloc_bootmem_pages_node(NODE_DATA(node), size); pr_debug("per cpu data for cpu%d on node%d at %016lx\n", cpu, node, __pa(ptr)); } #endif + + memcpy(ptr, __per_cpu_load, __per_cpu_end - __per_cpu_start); per_cpu_offset(cpu) = ptr - __per_cpu_start; - memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); + per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu); + per_cpu(cpu_number, cpu) = cpu; + setup_percpu_segment(cpu); + setup_stack_canary_segment(cpu); + /* + * Copy data used in early init routines from the + * initial arrays to the per cpu data areas. These + * arrays then become expendable and the *_early_ptr's + * are zeroed indicating that the static arrays are + * gone. + */ +#ifdef CONFIG_X86_LOCAL_APIC + per_cpu(x86_cpu_to_apicid, cpu) = + early_per_cpu_map(x86_cpu_to_apicid, cpu); + per_cpu(x86_bios_cpu_apicid, cpu) = + early_per_cpu_map(x86_bios_cpu_apicid, cpu); +#endif +#ifdef CONFIG_X86_64 + per_cpu(irq_stack_ptr, cpu) = + per_cpu(irq_stack_union.irq_stack, cpu) + + IRQ_STACK_SIZE - 64; +#ifdef CONFIG_NUMA + per_cpu(x86_cpu_to_node_map, cpu) = + early_per_cpu_map(x86_cpu_to_node_map, cpu); +#endif +#endif + /* + * Up to this point, the boot CPU has been using .data.init + * area. Reload any changed state for the boot CPU. + */ + if (cpu == boot_cpu_id) + switch_to_new_gdt(cpu); + + DBG("PERCPU: cpu %4d %p\n", cpu, ptr); } - /* Setup percpu data maps */ - setup_per_cpu_maps(); + /* indicate the early static arrays will soon be gone */ +#ifdef CONFIG_X86_LOCAL_APIC + early_per_cpu_ptr(x86_cpu_to_apicid) = NULL; + early_per_cpu_ptr(x86_bios_cpu_apicid) = NULL; +#endif +#if defined(CONFIG_X86_64) && defined(CONFIG_NUMA) + early_per_cpu_ptr(x86_cpu_to_node_map) = NULL; +#endif /* Setup node to cpumask map */ setup_node_to_cpumask_map(); @@ -210,199 +144,3 @@ void __init setup_per_cpu_areas(void) /* Setup cpu initialized, callin, callout masks */ setup_cpu_local_masks(); } - -#endif - -#ifdef X86_64_NUMA - -/* - * Allocate node_to_cpumask_map based on number of available nodes - * Requires node_possible_map to be valid. - * - * Note: node_to_cpumask() is not valid until after this is done. - */ -static void __init setup_node_to_cpumask_map(void) -{ - unsigned int node, num = 0; - cpumask_t *map; - - /* setup nr_node_ids if not done yet */ - if (nr_node_ids == MAX_NUMNODES) { - for_each_node_mask(node, node_possible_map) - num = node; - nr_node_ids = num + 1; - } - - /* allocate the map */ - map = alloc_bootmem_low(nr_node_ids * sizeof(cpumask_t)); - - pr_debug("Node to cpumask map at %p for %d nodes\n", - map, nr_node_ids); - - /* node_to_cpumask() will now work */ - node_to_cpumask_map = map; -} - -void __cpuinit numa_set_node(int cpu, int node) -{ - int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); - - if (cpu_pda(cpu) && node != NUMA_NO_NODE) - cpu_pda(cpu)->nodenumber = node; - - if (cpu_to_node_map) - cpu_to_node_map[cpu] = node; - - else if (per_cpu_offset(cpu)) - per_cpu(x86_cpu_to_node_map, cpu) = node; - - else - pr_debug("Setting node for non-present cpu %d\n", cpu); -} - -void __cpuinit numa_clear_node(int cpu) -{ - numa_set_node(cpu, NUMA_NO_NODE); -} - -#ifndef CONFIG_DEBUG_PER_CPU_MAPS - -void __cpuinit numa_add_cpu(int cpu) -{ - cpu_set(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); -} - -void __cpuinit numa_remove_cpu(int cpu) -{ - cpu_clear(cpu, node_to_cpumask_map[cpu_to_node(cpu)]); -} - -#else /* CONFIG_DEBUG_PER_CPU_MAPS */ - -/* - * --------- debug versions of the numa functions --------- - */ -static void __cpuinit numa_set_cpumask(int cpu, int enable) -{ - int node = cpu_to_node(cpu); - cpumask_t *mask; - char buf[64]; - - if (node_to_cpumask_map == NULL) { - printk(KERN_ERR "node_to_cpumask_map NULL\n"); - dump_stack(); - return; - } - - mask = &node_to_cpumask_map[node]; - if (enable) - cpu_set(cpu, *mask); - else - cpu_clear(cpu, *mask); - - cpulist_scnprintf(buf, sizeof(buf), mask); - printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n", - enable ? "numa_add_cpu" : "numa_remove_cpu", cpu, node, buf); -} - -void __cpuinit numa_add_cpu(int cpu) -{ - numa_set_cpumask(cpu, 1); -} - -void __cpuinit numa_remove_cpu(int cpu) -{ - numa_set_cpumask(cpu, 0); -} - -int cpu_to_node(int cpu) -{ - if (early_per_cpu_ptr(x86_cpu_to_node_map)) { - printk(KERN_WARNING - "cpu_to_node(%d): usage too early!\n", cpu); - dump_stack(); - return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu]; - } - return per_cpu(x86_cpu_to_node_map, cpu); -} -EXPORT_SYMBOL(cpu_to_node); - -/* - * Same function as cpu_to_node() but used if called before the - * per_cpu areas are setup. - */ -int early_cpu_to_node(int cpu) -{ - if (early_per_cpu_ptr(x86_cpu_to_node_map)) - return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu]; - - if (!per_cpu_offset(cpu)) { - printk(KERN_WARNING - "early_cpu_to_node(%d): no per_cpu area!\n", cpu); - dump_stack(); - return NUMA_NO_NODE; - } - return per_cpu(x86_cpu_to_node_map, cpu); -} - - -/* empty cpumask */ -static const cpumask_t cpu_mask_none; - -/* - * Returns a pointer to the bitmask of CPUs on Node 'node'. - */ -const cpumask_t *cpumask_of_node(int node) -{ - if (node_to_cpumask_map == NULL) { - printk(KERN_WARNING - "cpumask_of_node(%d): no node_to_cpumask_map!\n", - node); - dump_stack(); - return (const cpumask_t *)&cpu_online_map; - } - if (node >= nr_node_ids) { - printk(KERN_WARNING - "cpumask_of_node(%d): node > nr_node_ids(%d)\n", - node, nr_node_ids); - dump_stack(); - return &cpu_mask_none; - } - return &node_to_cpumask_map[node]; -} -EXPORT_SYMBOL(cpumask_of_node); - -/* - * Returns a bitmask of CPUs on Node 'node'. - * - * Side note: this function creates the returned cpumask on the stack - * so with a high NR_CPUS count, excessive stack space is used. The - * node_to_cpumask_ptr function should be used whenever possible. - */ -cpumask_t node_to_cpumask(int node) -{ - if (node_to_cpumask_map == NULL) { - printk(KERN_WARNING - "node_to_cpumask(%d): no node_to_cpumask_map!\n", node); - dump_stack(); - return cpu_online_map; - } - if (node >= nr_node_ids) { - printk(KERN_WARNING - "node_to_cpumask(%d): node > nr_node_ids(%d)\n", - node, nr_node_ids); - dump_stack(); - return cpu_mask_none; - } - return node_to_cpumask_map[node]; -} -EXPORT_SYMBOL(node_to_cpumask); - -/* - * --------- end of debug versions of the numa functions --------- - */ - -#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ - -#endif /* X86_64_NUMA */ - Index: linux-2.6-tip/arch/x86/kernel/signal.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/signal.c +++ linux-2.6-tip/arch/x86/kernel/signal.c @@ -6,7 +6,7 @@ * 2000-06-20 Pentium III FXSR, SSE support by Gareth Hughes * 2000-2002 x86-64 support by Andi Kleen */ - +#include #include #include #include @@ -50,27 +50,23 @@ # define FIX_EFLAGS __FIX_EFLAGS #endif -#define COPY(x) { \ - err |= __get_user(regs->x, &sc->x); \ -} - -#define COPY_SEG(seg) { \ - unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ - regs->seg = tmp; \ -} - -#define COPY_SEG_CPL3(seg) { \ - unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ - regs->seg = tmp | 3; \ -} - -#define GET_SEG(seg) { \ - unsigned short tmp; \ - err |= __get_user(tmp, &sc->seg); \ - loadsegment(seg, tmp); \ -} +#define COPY(x) do { \ + get_user_ex(regs->x, &sc->x); \ +} while (0) + +#define GET_SEG(seg) ({ \ + unsigned short tmp; \ + get_user_ex(tmp, &sc->seg); \ + tmp; \ +}) + +#define COPY_SEG(seg) do { \ + regs->seg = GET_SEG(seg); \ +} while (0) + +#define COPY_SEG_CPL3(seg) do { \ + regs->seg = GET_SEG(seg) | 3; \ +} while (0) static int restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc, @@ -83,45 +79,49 @@ restore_sigcontext(struct pt_regs *regs, /* Always make any pending restarted system calls return -EINTR */ current_thread_info()->restart_block.fn = do_no_restart_syscall; + get_user_try { + #ifdef CONFIG_X86_32 - GET_SEG(gs); - COPY_SEG(fs); - COPY_SEG(es); - COPY_SEG(ds); + set_user_gs(regs, GET_SEG(gs)); + COPY_SEG(fs); + COPY_SEG(es); + COPY_SEG(ds); #endif /* CONFIG_X86_32 */ - COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); - COPY(dx); COPY(cx); COPY(ip); + COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); + COPY(dx); COPY(cx); COPY(ip); #ifdef CONFIG_X86_64 - COPY(r8); - COPY(r9); - COPY(r10); - COPY(r11); - COPY(r12); - COPY(r13); - COPY(r14); - COPY(r15); + COPY(r8); + COPY(r9); + COPY(r10); + COPY(r11); + COPY(r12); + COPY(r13); + COPY(r14); + COPY(r15); #endif /* CONFIG_X86_64 */ #ifdef CONFIG_X86_32 - COPY_SEG_CPL3(cs); - COPY_SEG_CPL3(ss); + COPY_SEG_CPL3(cs); + COPY_SEG_CPL3(ss); #else /* !CONFIG_X86_32 */ - /* Kernel saves and restores only the CS segment register on signals, - * which is the bare minimum needed to allow mixed 32/64-bit code. - * App's signal handler can save/restore other segments if needed. */ - COPY_SEG_CPL3(cs); + /* Kernel saves and restores only the CS segment register on signals, + * which is the bare minimum needed to allow mixed 32/64-bit code. + * App's signal handler can save/restore other segments if needed. */ + COPY_SEG_CPL3(cs); #endif /* CONFIG_X86_32 */ - err |= __get_user(tmpflags, &sc->flags); - regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); - regs->orig_ax = -1; /* disable syscall checks */ + get_user_ex(tmpflags, &sc->flags); + regs->flags = (regs->flags & ~FIX_EFLAGS) | (tmpflags & FIX_EFLAGS); + regs->orig_ax = -1; /* disable syscall checks */ + + get_user_ex(buf, &sc->fpstate); + err |= restore_i387_xstate(buf); - err |= __get_user(buf, &sc->fpstate); - err |= restore_i387_xstate(buf); + get_user_ex(*pax, &sc->ax); + } get_user_catch(err); - err |= __get_user(*pax, &sc->ax); return err; } @@ -131,57 +131,55 @@ setup_sigcontext(struct sigcontext __use { int err = 0; -#ifdef CONFIG_X86_32 - { - unsigned int tmp; + put_user_try { - savesegment(gs, tmp); - err |= __put_user(tmp, (unsigned int __user *)&sc->gs); - } - err |= __put_user(regs->fs, (unsigned int __user *)&sc->fs); - err |= __put_user(regs->es, (unsigned int __user *)&sc->es); - err |= __put_user(regs->ds, (unsigned int __user *)&sc->ds); -#endif /* CONFIG_X86_32 */ - - err |= __put_user(regs->di, &sc->di); - err |= __put_user(regs->si, &sc->si); - err |= __put_user(regs->bp, &sc->bp); - err |= __put_user(regs->sp, &sc->sp); - err |= __put_user(regs->bx, &sc->bx); - err |= __put_user(regs->dx, &sc->dx); - err |= __put_user(regs->cx, &sc->cx); - err |= __put_user(regs->ax, &sc->ax); +#ifdef CONFIG_X86_32 + put_user_ex(get_user_gs(regs), (unsigned int __user *)&sc->gs); + put_user_ex(regs->fs, (unsigned int __user *)&sc->fs); + put_user_ex(regs->es, (unsigned int __user *)&sc->es); + put_user_ex(regs->ds, (unsigned int __user *)&sc->ds); +#endif /* CONFIG_X86_32 */ + + put_user_ex(regs->di, &sc->di); + put_user_ex(regs->si, &sc->si); + put_user_ex(regs->bp, &sc->bp); + put_user_ex(regs->sp, &sc->sp); + put_user_ex(regs->bx, &sc->bx); + put_user_ex(regs->dx, &sc->dx); + put_user_ex(regs->cx, &sc->cx); + put_user_ex(regs->ax, &sc->ax); #ifdef CONFIG_X86_64 - err |= __put_user(regs->r8, &sc->r8); - err |= __put_user(regs->r9, &sc->r9); - err |= __put_user(regs->r10, &sc->r10); - err |= __put_user(regs->r11, &sc->r11); - err |= __put_user(regs->r12, &sc->r12); - err |= __put_user(regs->r13, &sc->r13); - err |= __put_user(regs->r14, &sc->r14); - err |= __put_user(regs->r15, &sc->r15); + put_user_ex(regs->r8, &sc->r8); + put_user_ex(regs->r9, &sc->r9); + put_user_ex(regs->r10, &sc->r10); + put_user_ex(regs->r11, &sc->r11); + put_user_ex(regs->r12, &sc->r12); + put_user_ex(regs->r13, &sc->r13); + put_user_ex(regs->r14, &sc->r14); + put_user_ex(regs->r15, &sc->r15); #endif /* CONFIG_X86_64 */ - err |= __put_user(current->thread.trap_no, &sc->trapno); - err |= __put_user(current->thread.error_code, &sc->err); - err |= __put_user(regs->ip, &sc->ip); -#ifdef CONFIG_X86_32 - err |= __put_user(regs->cs, (unsigned int __user *)&sc->cs); - err |= __put_user(regs->flags, &sc->flags); - err |= __put_user(regs->sp, &sc->sp_at_signal); - err |= __put_user(regs->ss, (unsigned int __user *)&sc->ss); + put_user_ex(current->thread.trap_no, &sc->trapno); + put_user_ex(current->thread.error_code, &sc->err); + put_user_ex(regs->ip, &sc->ip); +#ifdef CONFIG_X86_32 + put_user_ex(regs->cs, (unsigned int __user *)&sc->cs); + put_user_ex(regs->flags, &sc->flags); + put_user_ex(regs->sp, &sc->sp_at_signal); + put_user_ex(regs->ss, (unsigned int __user *)&sc->ss); #else /* !CONFIG_X86_32 */ - err |= __put_user(regs->flags, &sc->flags); - err |= __put_user(regs->cs, &sc->cs); - err |= __put_user(0, &sc->gs); - err |= __put_user(0, &sc->fs); + put_user_ex(regs->flags, &sc->flags); + put_user_ex(regs->cs, &sc->cs); + put_user_ex(0, &sc->gs); + put_user_ex(0, &sc->fs); #endif /* CONFIG_X86_32 */ - err |= __put_user(fpstate, &sc->fpstate); - - /* non-iBCS2 extensions.. */ - err |= __put_user(mask, &sc->oldmask); - err |= __put_user(current->thread.cr2, &sc->cr2); + put_user_ex(fpstate, &sc->fpstate); + + /* non-iBCS2 extensions.. */ + put_user_ex(mask, &sc->oldmask); + put_user_ex(current->thread.cr2, &sc->cr2); + } put_user_catch(err); return err; } @@ -336,43 +334,41 @@ static int __setup_rt_frame(int sig, str if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) return -EFAULT; - err |= __put_user(sig, &frame->sig); - err |= __put_user(&frame->info, &frame->pinfo); - err |= __put_user(&frame->uc, &frame->puc); - err |= copy_siginfo_to_user(&frame->info, info); - if (err) - return -EFAULT; - - /* Create the ucontext. */ - if (cpu_has_xsave) - err |= __put_user(UC_FP_XSTATE, &frame->uc.uc_flags); - else - err |= __put_user(0, &frame->uc.uc_flags); - err |= __put_user(0, &frame->uc.uc_link); - err |= __put_user(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); - err |= __put_user(sas_ss_flags(regs->sp), - &frame->uc.uc_stack.ss_flags); - err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size); - err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, - regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - if (err) - return -EFAULT; - - /* Set up to return from userspace. */ - restorer = VDSO32_SYMBOL(current->mm->context.vdso, rt_sigreturn); - if (ka->sa.sa_flags & SA_RESTORER) - restorer = ka->sa.sa_restorer; - err |= __put_user(restorer, &frame->pretcode); + put_user_try { + put_user_ex(sig, &frame->sig); + put_user_ex(&frame->info, &frame->pinfo); + put_user_ex(&frame->uc, &frame->puc); + err |= copy_siginfo_to_user(&frame->info, info); + + /* Create the ucontext. */ + if (cpu_has_xsave) + put_user_ex(UC_FP_XSTATE, &frame->uc.uc_flags); + else + put_user_ex(0, &frame->uc.uc_flags); + put_user_ex(0, &frame->uc.uc_link); + put_user_ex(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); + put_user_ex(sas_ss_flags(regs->sp), + &frame->uc.uc_stack.ss_flags); + put_user_ex(current->sas_ss_size, &frame->uc.uc_stack.ss_size); + err |= setup_sigcontext(&frame->uc.uc_mcontext, fpstate, + regs, set->sig[0]); + err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + + /* Set up to return from userspace. */ + restorer = VDSO32_SYMBOL(current->mm->context.vdso, rt_sigreturn); + if (ka->sa.sa_flags & SA_RESTORER) + restorer = ka->sa.sa_restorer; + put_user_ex(restorer, &frame->pretcode); - /* - * This is movl $__NR_rt_sigreturn, %ax ; int $0x80 - * - * WE DO NOT USE IT ANY MORE! It's only left here for historical - * reasons and because gdb uses it as a signature to notice - * signal handler stack frames. - */ - err |= __put_user(*((u64 *)&rt_retcode), (u64 *)frame->retcode); + /* + * This is movl $__NR_rt_sigreturn, %ax ; int $0x80 + * + * WE DO NOT USE IT ANY MORE! It's only left here for historical + * reasons and because gdb uses it as a signature to notice + * signal handler stack frames. + */ + put_user_ex(*((u64 *)&rt_retcode), (u64 *)frame->retcode); + } put_user_catch(err); if (err) return -EFAULT; @@ -436,28 +432,30 @@ static int __setup_rt_frame(int sig, str return -EFAULT; } - /* Create the ucontext. */ - if (cpu_has_xsave) - err |= __put_user(UC_FP_XSTATE, &frame->uc.uc_flags); - else - err |= __put_user(0, &frame->uc.uc_flags); - err |= __put_user(0, &frame->uc.uc_link); - err |= __put_user(me->sas_ss_sp, &frame->uc.uc_stack.ss_sp); - err |= __put_user(sas_ss_flags(regs->sp), - &frame->uc.uc_stack.ss_flags); - err |= __put_user(me->sas_ss_size, &frame->uc.uc_stack.ss_size); - err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]); - err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); - - /* Set up to return from userspace. If provided, use a stub - already in userspace. */ - /* x86-64 should always use SA_RESTORER. */ - if (ka->sa.sa_flags & SA_RESTORER) { - err |= __put_user(ka->sa.sa_restorer, &frame->pretcode); - } else { - /* could use a vstub here */ - return -EFAULT; - } + put_user_try { + /* Create the ucontext. */ + if (cpu_has_xsave) + put_user_ex(UC_FP_XSTATE, &frame->uc.uc_flags); + else + put_user_ex(0, &frame->uc.uc_flags); + put_user_ex(0, &frame->uc.uc_link); + put_user_ex(me->sas_ss_sp, &frame->uc.uc_stack.ss_sp); + put_user_ex(sas_ss_flags(regs->sp), + &frame->uc.uc_stack.ss_flags); + put_user_ex(me->sas_ss_size, &frame->uc.uc_stack.ss_size); + err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]); + err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + + /* Set up to return from userspace. If provided, use a stub + already in userspace. */ + /* x86-64 should always use SA_RESTORER. */ + if (ka->sa.sa_flags & SA_RESTORER) { + put_user_ex(ka->sa.sa_restorer, &frame->pretcode); + } else { + /* could use a vstub here */ + err |= -EFAULT; + } + } put_user_catch(err); if (err) return -EFAULT; @@ -509,31 +507,41 @@ sys_sigaction(int sig, const struct old_ struct old_sigaction __user *oact) { struct k_sigaction new_ka, old_ka; - int ret; + int ret = 0; if (act) { old_sigset_t mask; - if (!access_ok(VERIFY_READ, act, sizeof(*act)) || - __get_user(new_ka.sa.sa_handler, &act->sa_handler) || - __get_user(new_ka.sa.sa_restorer, &act->sa_restorer)) + if (!access_ok(VERIFY_READ, act, sizeof(*act))) return -EFAULT; - __get_user(new_ka.sa.sa_flags, &act->sa_flags); - __get_user(mask, &act->sa_mask); + get_user_try { + get_user_ex(new_ka.sa.sa_handler, &act->sa_handler); + get_user_ex(new_ka.sa.sa_flags, &act->sa_flags); + get_user_ex(mask, &act->sa_mask); + get_user_ex(new_ka.sa.sa_restorer, &act->sa_restorer); + } get_user_catch(ret); + + if (ret) + return -EFAULT; siginitset(&new_ka.sa.sa_mask, mask); } ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); if (!ret && oact) { - if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || - __put_user(old_ka.sa.sa_handler, &oact->sa_handler) || - __put_user(old_ka.sa.sa_restorer, &oact->sa_restorer)) + if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact))) return -EFAULT; - __put_user(old_ka.sa.sa_flags, &oact->sa_flags); - __put_user(old_ka.sa.sa_mask.sig[0], &oact->sa_mask); + put_user_try { + put_user_ex(old_ka.sa.sa_handler, &oact->sa_handler); + put_user_ex(old_ka.sa.sa_flags, &oact->sa_flags); + put_user_ex(old_ka.sa.sa_mask.sig[0], &oact->sa_mask); + put_user_ex(old_ka.sa.sa_restorer, &oact->sa_restorer); + } put_user_catch(ret); + + if (ret) + return -EFAULT; } return ret; @@ -541,14 +549,9 @@ sys_sigaction(int sig, const struct old_ #endif /* CONFIG_X86_32 */ #ifdef CONFIG_X86_32 -asmlinkage int sys_sigaltstack(unsigned long bx) +int sys_sigaltstack(struct pt_regs *regs) { - /* - * This is needed to make gcc realize it doesn't own the - * "struct pt_regs" - */ - struct pt_regs *regs = (struct pt_regs *)&bx; - const stack_t __user *uss = (const stack_t __user *)bx; + const stack_t __user *uss = (const stack_t __user *)regs->bx; stack_t __user *uoss = (stack_t __user *)regs->cx; return do_sigaltstack(uss, uoss, regs->sp); @@ -566,14 +569,12 @@ sys_sigaltstack(const stack_t __user *us * Do a signal return; undo the signal stack. */ #ifdef CONFIG_X86_32 -asmlinkage unsigned long sys_sigreturn(unsigned long __unused) +unsigned long sys_sigreturn(struct pt_regs *regs) { struct sigframe __user *frame; - struct pt_regs *regs; unsigned long ax; sigset_t set; - regs = (struct pt_regs *) &__unused; frame = (struct sigframe __user *)(regs->sp - 8); if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) @@ -600,7 +601,7 @@ badframe: } #endif /* CONFIG_X86_32 */ -static long do_rt_sigreturn(struct pt_regs *regs) +long sys_rt_sigreturn(struct pt_regs *regs) { struct rt_sigframe __user *frame; unsigned long ax; @@ -631,25 +632,6 @@ badframe: return 0; } -#ifdef CONFIG_X86_32 -/* - * Note: do not pass in pt_regs directly as with tail-call optimization - * GCC will incorrectly stomp on the caller's frame and corrupt user-space - * register state: - */ -asmlinkage int sys_rt_sigreturn(unsigned long __unused) -{ - struct pt_regs *regs = (struct pt_regs *)&__unused; - - return do_rt_sigreturn(regs); -} -#else /* !CONFIG_X86_32 */ -asmlinkage long sys_rt_sigreturn(struct pt_regs *regs) -{ - return do_rt_sigreturn(regs); -} -#endif /* CONFIG_X86_32 */ - /* * OK, we're invoking a handler: */ @@ -804,6 +786,13 @@ static void do_signal(struct pt_regs *re int signr; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which is why we may in certain * cases get here from kernel mode. Just return without doing anything @@ -893,6 +882,11 @@ do_notify_resume(struct pt_regs *regs, v tracehook_notify_resume(regs); } + if (thread_info_flags & _TIF_PERF_COUNTERS) { + clear_thread_flag(TIF_PERF_COUNTERS); + perf_counter_notify(regs); + } + #ifdef CONFIG_X86_32 clear_thread_flag(TIF_IRET); #endif /* CONFIG_X86_32 */ Index: linux-2.6-tip/arch/x86/kernel/smp.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/smp.c +++ linux-2.6-tip/arch/x86/kernel/smp.c @@ -2,7 +2,7 @@ * Intel SMP support routines. * * (c) 1995 Alan Cox, Building #3 - * (c) 1998-99, 2000 Ingo Molnar + * (c) 1998-99, 2000, 2009 Ingo Molnar * (c) 2002,2003 Andi Kleen, SuSE Labs. * * i386 and x86_64 integration by Glauber Costa @@ -26,8 +26,7 @@ #include #include #include -#include -#include +#include /* * Some notes on x86 processor bugs affecting SMP operation: * @@ -118,12 +117,22 @@ static void native_smp_send_reschedule(i WARN_ON(1); return; } - send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR); + apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR); +} + +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + apic->send_IPI_allbutself(RESCHEDULE_VECTOR); } void native_send_call_func_single_ipi(int cpu) { - send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR); + apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR); } void native_send_call_func_ipi(const struct cpumask *mask) @@ -131,7 +140,7 @@ void native_send_call_func_ipi(const str cpumask_var_t allbutself; if (!alloc_cpumask_var(&allbutself, GFP_ATOMIC)) { - send_IPI_mask(mask, CALL_FUNCTION_VECTOR); + apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR); return; } @@ -140,9 +149,9 @@ void native_send_call_func_ipi(const str if (cpumask_equal(mask, allbutself) && cpumask_equal(cpu_online_mask, cpu_callout_mask)) - send_IPI_allbutself(CALL_FUNCTION_VECTOR); + apic->send_IPI_allbutself(CALL_FUNCTION_VECTOR); else - send_IPI_mask(mask, CALL_FUNCTION_VECTOR); + apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR); free_cpumask_var(allbutself); } Index: linux-2.6-tip/arch/x86/kernel/smpboot.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/smpboot.c +++ linux-2.6-tip/arch/x86/kernel/smpboot.c @@ -2,7 +2,7 @@ * x86 SMP booting functions * * (c) 1995 Alan Cox, Building #3 - * (c) 1998, 1999, 2000 Ingo Molnar + * (c) 1998, 1999, 2000, 2009 Ingo Molnar * Copyright 2001 Andi Kleen, SuSE Labs. * * Much of the core SMP work is based on previous work by Thomas Radke, to @@ -53,7 +53,6 @@ #include #include #include -#include #include #include #include @@ -61,13 +60,12 @@ #include #include #include -#include +#include #include +#include #include -#include -#include -#include +#include #ifdef CONFIG_X86_32 u8 apicid_2_node[MAX_APICID]; @@ -163,7 +161,7 @@ static void map_cpu_to_logical_apicid(vo { int cpu = smp_processor_id(); int apicid = logical_smp_processor_id(); - int node = apicid_to_node(apicid); + int node = apic->apicid_to_node(apicid); if (!node_online(node)) node = first_online_node; @@ -196,7 +194,8 @@ static void __cpuinit smp_callin(void) * our local APIC. We have to wait for the IPI or we'll * lock up on an APIC access. */ - wait_for_init_deassert(&init_deasserted); + if (apic->wait_for_init_deassert) + apic->wait_for_init_deassert(&init_deasserted); /* * (This works even if the APIC is not enabled.) @@ -243,7 +242,8 @@ static void __cpuinit smp_callin(void) */ pr_debug("CALLIN, before setup_local_APIC().\n"); - smp_callin_clear_local_apic(); + if (apic->smp_callin_clear_local_apic) + apic->smp_callin_clear_local_apic(); setup_local_APIC(); end_local_APIC_setup(); map_cpu_to_logical_apicid(); @@ -583,7 +583,7 @@ wakeup_secondary_cpu_via_nmi(int logical /* Target chip */ /* Boot on the stack */ /* Kick the second */ - apic_icr_write(APIC_DM_NMI | APIC_DEST_LOGICAL, logical_apicid); + apic_icr_write(APIC_DM_NMI | apic->dest_logical, logical_apicid); pr_debug("Waiting for send to finish...\n"); send_status = safe_apic_wait_icr_idle(); @@ -745,78 +745,22 @@ static void __cpuinit do_fork_idle(struc complete(&c_idle->done); } -#ifdef CONFIG_X86_64 - -/* __ref because it's safe to call free_bootmem when after_bootmem == 0. */ -static void __ref free_bootmem_pda(struct x8664_pda *oldpda) -{ - if (!after_bootmem) - free_bootmem((unsigned long)oldpda, sizeof(*oldpda)); -} - -/* - * Allocate node local memory for the AP pda. - * - * Must be called after the _cpu_pda pointer table is initialized. - */ -int __cpuinit get_local_pda(int cpu) -{ - struct x8664_pda *oldpda, *newpda; - unsigned long size = sizeof(struct x8664_pda); - int node = cpu_to_node(cpu); - - if (cpu_pda(cpu) && !cpu_pda(cpu)->in_bootmem) - return 0; - - oldpda = cpu_pda(cpu); - newpda = kmalloc_node(size, GFP_ATOMIC, node); - if (!newpda) { - printk(KERN_ERR "Could not allocate node local PDA " - "for CPU %d on node %d\n", cpu, node); - - if (oldpda) - return 0; /* have a usable pda */ - else - return -1; - } - - if (oldpda) { - memcpy(newpda, oldpda, size); - free_bootmem_pda(oldpda); - } - - newpda->in_bootmem = 0; - cpu_pda(cpu) = newpda; - return 0; -} -#endif /* CONFIG_X86_64 */ - -static int __cpuinit do_boot_cpu(int apicid, int cpu) /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad * (ie clustered apic addressing mode), this is a LOGICAL apic ID. - * Returns zero if CPU booted OK, else error code from wakeup_secondary_cpu. + * Returns zero if CPU booted OK, else error code from ->wakeup_cpu. */ +static int __cpuinit do_boot_cpu(int apicid, int cpu) { unsigned long boot_error = 0; - int timeout; unsigned long start_ip; - unsigned short nmi_high = 0, nmi_low = 0; + int timeout; struct create_idle c_idle = { - .cpu = cpu, - .done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done), + .cpu = cpu, + .done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done), }; - INIT_WORK(&c_idle.work, do_fork_idle); -#ifdef CONFIG_X86_64 - /* Allocate node local memory for AP pdas */ - if (cpu > 0) { - boot_error = get_local_pda(cpu); - if (boot_error) - goto restore_state; - /* if can't get pda memory, can't start cpu */ - } -#endif + INIT_WORK(&c_idle.work, do_fork_idle); alternatives_smp_switch(1); @@ -847,14 +791,16 @@ static int __cpuinit do_boot_cpu(int api set_idle_for_cpu(cpu, c_idle.idle); do_rest: -#ifdef CONFIG_X86_32 per_cpu(current_task, cpu) = c_idle.idle; - init_gdt(cpu); +#ifdef CONFIG_X86_32 /* Stack for startup_32 can be just as for start_secondary onwards */ irq_ctx_init(cpu); #else - cpu_pda(cpu)->pcurrent = c_idle.idle; clear_tsk_thread_flag(c_idle.idle, TIF_FORK); + initial_gs = per_cpu_offset(cpu); + per_cpu(kernel_stack, cpu) = + (unsigned long)task_stack_page(c_idle.idle) - + KERNEL_STACK_OFFSET + THREAD_SIZE; #endif early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu); initial_code = (unsigned long)start_secondary; @@ -878,8 +824,6 @@ do_rest: pr_debug("Setting warm reset code and vector.\n"); - store_NMI_vector(&nmi_high, &nmi_low); - smpboot_setup_warm_reset_vector(start_ip); /* * Be paranoid about clearing APIC errors. @@ -893,7 +837,7 @@ do_rest: /* * Starting actual IPI sequence... */ - boot_error = wakeup_secondary_cpu(apicid, start_ip); + boot_error = apic->wakeup_cpu(apicid, start_ip); if (!boot_error) { /* @@ -927,13 +871,11 @@ do_rest: else /* trampoline code not run */ printk(KERN_ERR "Not responding.\n"); - if (get_uv_system_type() != UV_NON_UNIQUE_APIC) - inquire_remote_apic(apicid); + if (apic->inquire_remote_apic) + apic->inquire_remote_apic(apicid); } } -#ifdef CONFIG_X86_64 -restore_state: -#endif + if (boot_error) { /* Try to put things back the way they were before ... */ numa_remove_cpu(cpu); /* was set by numa_add_cpu */ @@ -961,7 +903,7 @@ restore_state: int __cpuinit native_cpu_up(unsigned int cpu) { - int apicid = cpu_present_to_apicid(cpu); + int apicid = apic->cpu_present_to_apicid(cpu); unsigned long flags; int err; @@ -1054,14 +996,14 @@ static int __init smp_sanity_check(unsig { preempt_disable(); -#if defined(CONFIG_X86_PC) && defined(CONFIG_X86_32) +#if !defined(CONFIG_X86_BIGSMP) && defined(CONFIG_X86_32) if (def_to_bigsmp && nr_cpu_ids > 8) { unsigned int cpu; unsigned nr; printk(KERN_WARNING "More than 8 CPUs detected - skipping them.\n" - "Use CONFIG_X86_GENERICARCH and CONFIG_X86_BIGSMP.\n"); + "Use CONFIG_X86_BIGSMP.\n"); nr = 0; for_each_present_cpu(cpu) { @@ -1107,7 +1049,7 @@ static int __init smp_sanity_check(unsig * Should not be necessary because the MP table should list the boot * CPU too, but we do it for the sake of robustness anyway. */ - if (!check_phys_apicid_present(boot_cpu_physical_apicid)) { + if (!apic->check_phys_apicid_present(boot_cpu_physical_apicid)) { printk(KERN_NOTICE "weird, boot CPU (#%d) not listed by the BIOS.\n", boot_cpu_physical_apicid); @@ -1125,6 +1067,7 @@ static int __init smp_sanity_check(unsig printk(KERN_ERR "... forcing use of dummy APIC emulation." "(tell your hw vendor)\n"); smpboot_clear_io_apic(); + arch_disable_smp_support(); return -1; } @@ -1181,9 +1124,9 @@ void __init native_smp_prepare_cpus(unsi current_thread_info()->cpu = 0; /* needed? */ set_cpu_sibling_map(0); -#ifdef CONFIG_X86_64 enable_IR_x2apic(); - setup_apic_routing(); +#ifdef CONFIG_X86_64 + default_setup_apic_routing(); #endif if (smp_sanity_check(max_cpus) < 0) { @@ -1207,18 +1150,18 @@ void __init native_smp_prepare_cpus(unsi */ setup_local_APIC(); -#ifdef CONFIG_X86_64 /* * Enable IO APIC before setting up error vector */ if (!skip_ioapic_setup && nr_ioapics) enable_IO_APIC(); -#endif + end_local_APIC_setup(); map_cpu_to_logical_apicid(); - setup_portio_remap(); + if (apic->setup_portio_remap) + apic->setup_portio_remap(); smpboot_setup_io_apic(); /* @@ -1240,10 +1183,7 @@ out: void __init native_smp_prepare_boot_cpu(void) { int me = smp_processor_id(); -#ifdef CONFIG_X86_32 - init_gdt(me); -#endif - switch_to_new_gdt(); + switch_to_new_gdt(me); /* already set me in cpu_online_mask in boot_cpu_init() */ cpumask_set_cpu(me, cpu_callout_mask); per_cpu(cpu_state, me) = CPU_ONLINE; Index: linux-2.6-tip/arch/x86/kernel/smpcommon.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/smpcommon.c +++ /dev/null @@ -1,30 +0,0 @@ -/* - * SMP stuff which is common to all sub-architectures. - */ -#include -#include - -#ifdef CONFIG_X86_32 -DEFINE_PER_CPU(unsigned long, this_cpu_off); -EXPORT_PER_CPU_SYMBOL(this_cpu_off); - -/* - * Initialize the CPU's GDT. This is either the boot CPU doing itself - * (still using the master per-cpu area), or a CPU doing it for a - * secondary which will soon come up. - */ -__cpuinit void init_gdt(int cpu) -{ - struct desc_struct gdt; - - pack_descriptor(&gdt, __per_cpu_offset[cpu], 0xFFFFF, - 0x2 | DESCTYPE_S, 0x8); - gdt.s = 1; - - write_gdt_entry(get_cpu_gdt_table(cpu), - GDT_ENTRY_PERCPU, &gdt, DESCTYPE_S); - - per_cpu(this_cpu_off, cpu) = __per_cpu_offset[cpu]; - per_cpu(cpu_number, cpu) = cpu; -} -#endif Index: linux-2.6-tip/arch/x86/kernel/stacktrace.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/stacktrace.c +++ linux-2.6-tip/arch/x86/kernel/stacktrace.c @@ -1,7 +1,7 @@ /* * Stack trace management functions * - * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar + * Copyright (C) 2006-2009 Red Hat, Inc., Ingo Molnar */ #include #include @@ -77,6 +77,13 @@ void save_stack_trace(struct stack_trace } EXPORT_SYMBOL_GPL(save_stack_trace); +void save_stack_trace_bp(struct stack_trace *trace, unsigned long bp) +{ + dump_trace(current, NULL, NULL, bp, &save_stack_ops, trace); + if (trace->nr_entries < trace->max_entries) + trace->entries[trace->nr_entries++] = ULONG_MAX; +} + void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace) { dump_trace(tsk, NULL, NULL, 0, &save_stack_ops_nosched, trace); Index: linux-2.6-tip/arch/x86/kernel/summit_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/summit_32.c +++ /dev/null @@ -1,188 +0,0 @@ -/* - * IBM Summit-Specific Code - * - * Written By: Matthew Dobson, IBM Corporation - * - * Copyright (c) 2003 IBM Corp. - * - * All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or (at - * your option) any later version. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or - * NON INFRINGEMENT. See the GNU General Public License for more - * details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - * Send feedback to - * - */ - -#include -#include -#include -#include -#include - -static struct rio_table_hdr *rio_table_hdr __initdata; -static struct scal_detail *scal_devs[MAX_NUMNODES] __initdata; -static struct rio_detail *rio_devs[MAX_NUMNODES*4] __initdata; - -#ifndef CONFIG_X86_NUMAQ -static int mp_bus_id_to_node[MAX_MP_BUSSES] __initdata; -#endif - -static int __init setup_pci_node_map_for_wpeg(int wpeg_num, int last_bus) -{ - int twister = 0, node = 0; - int i, bus, num_buses; - - for (i = 0; i < rio_table_hdr->num_rio_dev; i++) { - if (rio_devs[i]->node_id == rio_devs[wpeg_num]->owner_id) { - twister = rio_devs[i]->owner_id; - break; - } - } - if (i == rio_table_hdr->num_rio_dev) { - printk(KERN_ERR "%s: Couldn't find owner Cyclone for Winnipeg!\n", __func__); - return last_bus; - } - - for (i = 0; i < rio_table_hdr->num_scal_dev; i++) { - if (scal_devs[i]->node_id == twister) { - node = scal_devs[i]->node_id; - break; - } - } - if (i == rio_table_hdr->num_scal_dev) { - printk(KERN_ERR "%s: Couldn't find owner Twister for Cyclone!\n", __func__); - return last_bus; - } - - switch (rio_devs[wpeg_num]->type) { - case CompatWPEG: - /* - * The Compatibility Winnipeg controls the 2 legacy buses, - * the 66MHz PCI bus [2 slots] and the 2 "extra" buses in case - * a PCI-PCI bridge card is used in either slot: total 5 buses. - */ - num_buses = 5; - break; - case AltWPEG: - /* - * The Alternate Winnipeg controls the 2 133MHz buses [1 slot - * each], their 2 "extra" buses, the 100MHz bus [2 slots] and - * the "extra" buses for each of those slots: total 7 buses. - */ - num_buses = 7; - break; - case LookOutAWPEG: - case LookOutBWPEG: - /* - * A Lookout Winnipeg controls 3 100MHz buses [2 slots each] - * & the "extra" buses for each of those slots: total 9 buses. - */ - num_buses = 9; - break; - default: - printk(KERN_INFO "%s: Unsupported Winnipeg type!\n", __func__); - return last_bus; - } - - for (bus = last_bus; bus < last_bus + num_buses; bus++) - mp_bus_id_to_node[bus] = node; - return bus; -} - -static int __init build_detail_arrays(void) -{ - unsigned long ptr; - int i, scal_detail_size, rio_detail_size; - - if (rio_table_hdr->num_scal_dev > MAX_NUMNODES) { - printk(KERN_WARNING "%s: MAX_NUMNODES too low! Defined as %d, but system has %d nodes.\n", __func__, MAX_NUMNODES, rio_table_hdr->num_scal_dev); - return 0; - } - - switch (rio_table_hdr->version) { - default: - printk(KERN_WARNING "%s: Invalid Rio Grande Table Version: %d\n", __func__, rio_table_hdr->version); - return 0; - case 2: - scal_detail_size = 11; - rio_detail_size = 13; - break; - case 3: - scal_detail_size = 12; - rio_detail_size = 15; - break; - } - - ptr = (unsigned long)rio_table_hdr + 3; - for (i = 0; i < rio_table_hdr->num_scal_dev; i++, ptr += scal_detail_size) - scal_devs[i] = (struct scal_detail *)ptr; - - for (i = 0; i < rio_table_hdr->num_rio_dev; i++, ptr += rio_detail_size) - rio_devs[i] = (struct rio_detail *)ptr; - - return 1; -} - -void __init setup_summit(void) -{ - unsigned long ptr; - unsigned short offset; - int i, next_wpeg, next_bus = 0; - - /* The pointer to the EBDA is stored in the word @ phys 0x40E(40:0E) */ - ptr = get_bios_ebda(); - ptr = (unsigned long)phys_to_virt(ptr); - - rio_table_hdr = NULL; - offset = 0x180; - while (offset) { - /* The block id is stored in the 2nd word */ - if (*((unsigned short *)(ptr + offset + 2)) == 0x4752) { - /* set the pointer past the offset & block id */ - rio_table_hdr = (struct rio_table_hdr *)(ptr + offset + 4); - break; - } - /* The next offset is stored in the 1st word. 0 means no more */ - offset = *((unsigned short *)(ptr + offset)); - } - if (!rio_table_hdr) { - printk(KERN_ERR "%s: Unable to locate Rio Grande Table in EBDA - bailing!\n", __func__); - return; - } - - if (!build_detail_arrays()) - return; - - /* The first Winnipeg we're looking for has an index of 0 */ - next_wpeg = 0; - do { - for (i = 0; i < rio_table_hdr->num_rio_dev; i++) { - if (is_WPEG(rio_devs[i]) && rio_devs[i]->WP_index == next_wpeg) { - /* It's the Winnipeg we're looking for! */ - next_bus = setup_pci_node_map_for_wpeg(i, next_bus); - next_wpeg++; - break; - } - } - /* - * If we go through all Rio devices and don't find one with - * the next index, it means we've found all the Winnipegs, - * and thus all the PCI buses. - */ - if (i == rio_table_hdr->num_rio_dev) - next_wpeg = 0; - } while (next_wpeg != 0); -} Index: linux-2.6-tip/arch/x86/kernel/syscall_table_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/syscall_table_32.S +++ linux-2.6-tip/arch/x86/kernel/syscall_table_32.S @@ -1,7 +1,7 @@ ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ .long sys_exit - .long sys_fork + .long ptregs_fork .long sys_read .long sys_write .long sys_open /* 5 */ @@ -10,7 +10,7 @@ ENTRY(sys_call_table) .long sys_creat .long sys_link .long sys_unlink /* 10 */ - .long sys_execve + .long ptregs_execve .long sys_chdir .long sys_time .long sys_mknod @@ -109,17 +109,17 @@ ENTRY(sys_call_table) .long sys_newlstat .long sys_newfstat .long sys_uname - .long sys_iopl /* 110 */ + .long ptregs_iopl /* 110 */ .long sys_vhangup .long sys_ni_syscall /* old "idle" system call */ - .long sys_vm86old + .long ptregs_vm86old .long sys_wait4 .long sys_swapoff /* 115 */ .long sys_sysinfo .long sys_ipc .long sys_fsync - .long sys_sigreturn - .long sys_clone /* 120 */ + .long ptregs_sigreturn + .long ptregs_clone /* 120 */ .long sys_setdomainname .long sys_newuname .long sys_modify_ldt @@ -165,14 +165,14 @@ ENTRY(sys_call_table) .long sys_mremap .long sys_setresuid16 .long sys_getresuid16 /* 165 */ - .long sys_vm86 + .long ptregs_vm86 .long sys_ni_syscall /* Old sys_query_module */ .long sys_poll .long sys_nfsservctl .long sys_setresgid16 /* 170 */ .long sys_getresgid16 .long sys_prctl - .long sys_rt_sigreturn + .long ptregs_rt_sigreturn .long sys_rt_sigaction .long sys_rt_sigprocmask /* 175 */ .long sys_rt_sigpending @@ -185,11 +185,11 @@ ENTRY(sys_call_table) .long sys_getcwd .long sys_capget .long sys_capset /* 185 */ - .long sys_sigaltstack + .long ptregs_sigaltstack .long sys_sendfile .long sys_ni_syscall /* reserved for streams1 */ .long sys_ni_syscall /* reserved for streams2 */ - .long sys_vfork /* 190 */ + .long ptregs_vfork /* 190 */ .long sys_getrlimit .long sys_mmap2 .long sys_truncate64 @@ -332,3 +332,4 @@ ENTRY(sys_call_table) .long sys_dup3 /* 330 */ .long sys_pipe2 .long sys_inotify_init1 + .long sys_perf_counter_open Index: linux-2.6-tip/arch/x86/kernel/time_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/time_32.c +++ linux-2.6-tip/arch/x86/kernel/time_32.c @@ -33,12 +33,12 @@ #include #include -#include +#include #include #include #include -#include "do_timer.h" +#include int timer_ack; @@ -118,7 +118,7 @@ void __init hpet_time_init(void) { if (!hpet_enable()) setup_pit_timer(); - time_init_hook(); + x86_quirk_time_init(); } /* @@ -131,7 +131,7 @@ void __init hpet_time_init(void) */ void __init time_init(void) { - pre_time_init_hook(); + x86_quirk_pre_time_init(); tsc_init(); late_time_init = choose_time_init(); } Index: linux-2.6-tip/arch/x86/kernel/tlb_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/tlb_32.c +++ /dev/null @@ -1,256 +0,0 @@ -#include -#include -#include - -#include - -DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate) - ____cacheline_aligned = { &init_mm, 0, }; - -/* must come after the send_IPI functions above for inlining */ -#include - -/* - * Smarter SMP flushing macros. - * c/o Linus Torvalds. - * - * These mean you can really definitely utterly forget about - * writing to user space from interrupts. (Its not allowed anyway). - * - * Optimizations Manfred Spraul - */ - -static cpumask_t flush_cpumask; -static struct mm_struct *flush_mm; -static unsigned long flush_va; -static DEFINE_SPINLOCK(tlbstate_lock); - -/* - * We cannot call mmdrop() because we are in interrupt context, - * instead update mm->cpu_vm_mask. - * - * We need to reload %cr3 since the page tables may be going - * away from under us.. - */ -void leave_mm(int cpu) -{ - BUG_ON(x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_OK); - cpu_clear(cpu, x86_read_percpu(cpu_tlbstate.active_mm)->cpu_vm_mask); - load_cr3(swapper_pg_dir); -} -EXPORT_SYMBOL_GPL(leave_mm); - -/* - * - * The flush IPI assumes that a thread switch happens in this order: - * [cpu0: the cpu that switches] - * 1) switch_mm() either 1a) or 1b) - * 1a) thread switch to a different mm - * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask); - * Stop ipi delivery for the old mm. This is not synchronized with - * the other cpus, but smp_invalidate_interrupt ignore flush ipis - * for the wrong mm, and in the worst case we perform a superfluous - * tlb flush. - * 1a2) set cpu_tlbstate to TLBSTATE_OK - * Now the smp_invalidate_interrupt won't call leave_mm if cpu0 - * was in lazy tlb mode. - * 1a3) update cpu_tlbstate[].active_mm - * Now cpu0 accepts tlb flushes for the new mm. - * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask); - * Now the other cpus will send tlb flush ipis. - * 1a4) change cr3. - * 1b) thread switch without mm change - * cpu_tlbstate[].active_mm is correct, cpu0 already handles - * flush ipis. - * 1b1) set cpu_tlbstate to TLBSTATE_OK - * 1b2) test_and_set the cpu bit in cpu_vm_mask. - * Atomically set the bit [other cpus will start sending flush ipis], - * and test the bit. - * 1b3) if the bit was 0: leave_mm was called, flush the tlb. - * 2) switch %%esp, ie current - * - * The interrupt must handle 2 special cases: - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm. - * - the cpu performs speculative tlb reads, i.e. even if the cpu only - * runs in kernel space, the cpu could load tlb entries for user space - * pages. - * - * The good news is that cpu_tlbstate is local to each cpu, no - * write/read ordering problems. - */ - -/* - * TLB flush IPI: - * - * 1) Flush the tlb entries if the cpu uses the mm that's being flushed. - * 2) Leave the mm if we are in the lazy tlb mode. - */ - -void smp_invalidate_interrupt(struct pt_regs *regs) -{ - unsigned long cpu; - - cpu = get_cpu(); - - if (!cpu_isset(cpu, flush_cpumask)) - goto out; - /* - * This was a BUG() but until someone can quote me the - * line from the intel manual that guarantees an IPI to - * multiple CPUs is retried _only_ on the erroring CPUs - * its staying as a return - * - * BUG(); - */ - - if (flush_mm == x86_read_percpu(cpu_tlbstate.active_mm)) { - if (x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_OK) { - if (flush_va == TLB_FLUSH_ALL) - local_flush_tlb(); - else - __flush_tlb_one(flush_va); - } else - leave_mm(cpu); - } - ack_APIC_irq(); - smp_mb__before_clear_bit(); - cpu_clear(cpu, flush_cpumask); - smp_mb__after_clear_bit(); -out: - put_cpu_no_resched(); - inc_irq_stat(irq_tlb_count); -} - -void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm, - unsigned long va) -{ - cpumask_t cpumask = *cpumaskp; - - /* - * A couple of (to be removed) sanity checks: - * - * - current CPU must not be in mask - * - mask must exist :) - */ - BUG_ON(cpus_empty(cpumask)); - BUG_ON(cpu_isset(smp_processor_id(), cpumask)); - BUG_ON(!mm); - -#ifdef CONFIG_HOTPLUG_CPU - /* If a CPU which we ran on has gone down, OK. */ - cpus_and(cpumask, cpumask, cpu_online_map); - if (unlikely(cpus_empty(cpumask))) - return; -#endif - - /* - * i'm not happy about this global shared spinlock in the - * MM hot path, but we'll see how contended it is. - * AK: x86-64 has a faster method that could be ported. - */ - spin_lock(&tlbstate_lock); - - flush_mm = mm; - flush_va = va; - cpus_or(flush_cpumask, cpumask, flush_cpumask); - - /* - * Make the above memory operations globally visible before - * sending the IPI. - */ - smp_mb(); - /* - * We have to send the IPI only to - * CPUs affected. - */ - send_IPI_mask(&cpumask, INVALIDATE_TLB_VECTOR); - - while (!cpus_empty(flush_cpumask)) - /* nothing. lockup detection does not belong here */ - cpu_relax(); - - flush_mm = NULL; - flush_va = 0; - spin_unlock(&tlbstate_lock); -} - -void flush_tlb_current_task(void) -{ - struct mm_struct *mm = current->mm; - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - local_flush_tlb(); - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - preempt_enable(); -} - -void flush_tlb_mm(struct mm_struct *mm) -{ - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - if (current->active_mm == mm) { - if (current->mm) - local_flush_tlb(); - else - leave_mm(smp_processor_id()); - } - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - - preempt_enable(); -} - -void flush_tlb_page(struct vm_area_struct *vma, unsigned long va) -{ - struct mm_struct *mm = vma->vm_mm; - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - if (current->active_mm == mm) { - if (current->mm) - __flush_tlb_one(va); - else - leave_mm(smp_processor_id()); - } - - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, va); - - preempt_enable(); -} -EXPORT_SYMBOL(flush_tlb_page); - -static void do_flush_tlb_all(void *info) -{ - unsigned long cpu = smp_processor_id(); - - __flush_tlb_all(); - if (x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_LAZY) - leave_mm(cpu); -} - -void flush_tlb_all(void) -{ - on_each_cpu(do_flush_tlb_all, NULL, 1); -} - -void reset_lazy_tlbstate(void) -{ - int cpu = raw_smp_processor_id(); - - per_cpu(cpu_tlbstate, cpu).state = 0; - per_cpu(cpu_tlbstate, cpu).active_mm = &init_mm; -} - Index: linux-2.6-tip/arch/x86/kernel/tlb_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/tlb_64.c +++ /dev/null @@ -1,284 +0,0 @@ -#include - -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -/* - * Smarter SMP flushing macros. - * c/o Linus Torvalds. - * - * These mean you can really definitely utterly forget about - * writing to user space from interrupts. (Its not allowed anyway). - * - * Optimizations Manfred Spraul - * - * More scalable flush, from Andi Kleen - * - * To avoid global state use 8 different call vectors. - * Each CPU uses a specific vector to trigger flushes on other - * CPUs. Depending on the received vector the target CPUs look into - * the right per cpu variable for the flush data. - * - * With more than 8 CPUs they are hashed to the 8 available - * vectors. The limited global vector space forces us to this right now. - * In future when interrupts are split into per CPU domains this could be - * fixed, at the cost of triggering multiple IPIs in some cases. - */ - -union smp_flush_state { - struct { - cpumask_t flush_cpumask; - struct mm_struct *flush_mm; - unsigned long flush_va; - spinlock_t tlbstate_lock; - }; - char pad[SMP_CACHE_BYTES]; -} ____cacheline_aligned; - -/* State is put into the per CPU data section, but padded - to a full cache line because other CPUs can access it and we don't - want false sharing in the per cpu data segment. */ -static DEFINE_PER_CPU(union smp_flush_state, flush_state); - -/* - * We cannot call mmdrop() because we are in interrupt context, - * instead update mm->cpu_vm_mask. - */ -void leave_mm(int cpu) -{ - if (read_pda(mmu_state) == TLBSTATE_OK) - BUG(); - cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask); - load_cr3(swapper_pg_dir); -} -EXPORT_SYMBOL_GPL(leave_mm); - -/* - * - * The flush IPI assumes that a thread switch happens in this order: - * [cpu0: the cpu that switches] - * 1) switch_mm() either 1a) or 1b) - * 1a) thread switch to a different mm - * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask); - * Stop ipi delivery for the old mm. This is not synchronized with - * the other cpus, but smp_invalidate_interrupt ignore flush ipis - * for the wrong mm, and in the worst case we perform a superfluous - * tlb flush. - * 1a2) set cpu mmu_state to TLBSTATE_OK - * Now the smp_invalidate_interrupt won't call leave_mm if cpu0 - * was in lazy tlb mode. - * 1a3) update cpu active_mm - * Now cpu0 accepts tlb flushes for the new mm. - * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask); - * Now the other cpus will send tlb flush ipis. - * 1a4) change cr3. - * 1b) thread switch without mm change - * cpu active_mm is correct, cpu0 already handles - * flush ipis. - * 1b1) set cpu mmu_state to TLBSTATE_OK - * 1b2) test_and_set the cpu bit in cpu_vm_mask. - * Atomically set the bit [other cpus will start sending flush ipis], - * and test the bit. - * 1b3) if the bit was 0: leave_mm was called, flush the tlb. - * 2) switch %%esp, ie current - * - * The interrupt must handle 2 special cases: - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm. - * - the cpu performs speculative tlb reads, i.e. even if the cpu only - * runs in kernel space, the cpu could load tlb entries for user space - * pages. - * - * The good news is that cpu mmu_state is local to each cpu, no - * write/read ordering problems. - */ - -/* - * TLB flush IPI: - * - * 1) Flush the tlb entries if the cpu uses the mm that's being flushed. - * 2) Leave the mm if we are in the lazy tlb mode. - * - * Interrupts are disabled. - */ - -asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs) -{ - int cpu; - int sender; - union smp_flush_state *f; - - cpu = smp_processor_id(); - /* - * orig_rax contains the negated interrupt vector. - * Use that to determine where the sender put the data. - */ - sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START; - f = &per_cpu(flush_state, sender); - - if (!cpu_isset(cpu, f->flush_cpumask)) - goto out; - /* - * This was a BUG() but until someone can quote me the - * line from the intel manual that guarantees an IPI to - * multiple CPUs is retried _only_ on the erroring CPUs - * its staying as a return - * - * BUG(); - */ - - if (f->flush_mm == read_pda(active_mm)) { - if (read_pda(mmu_state) == TLBSTATE_OK) { - if (f->flush_va == TLB_FLUSH_ALL) - local_flush_tlb(); - else - __flush_tlb_one(f->flush_va); - } else - leave_mm(cpu); - } -out: - ack_APIC_irq(); - cpu_clear(cpu, f->flush_cpumask); - inc_irq_stat(irq_tlb_count); -} - -void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm, - unsigned long va) -{ - int sender; - union smp_flush_state *f; - cpumask_t cpumask = *cpumaskp; - - if (is_uv_system() && uv_flush_tlb_others(&cpumask, mm, va)) - return; - - /* Caller has disabled preemption */ - sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS; - f = &per_cpu(flush_state, sender); - - /* - * Could avoid this lock when - * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is - * probably not worth checking this for a cache-hot lock. - */ - spin_lock(&f->tlbstate_lock); - - f->flush_mm = mm; - f->flush_va = va; - cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask); - - /* - * Make the above memory operations globally visible before - * sending the IPI. - */ - smp_mb(); - /* - * We have to send the IPI only to - * CPUs affected. - */ - send_IPI_mask(&cpumask, INVALIDATE_TLB_VECTOR_START + sender); - - while (!cpus_empty(f->flush_cpumask)) - cpu_relax(); - - f->flush_mm = NULL; - f->flush_va = 0; - spin_unlock(&f->tlbstate_lock); -} - -static int __cpuinit init_smp_flush(void) -{ - int i; - - for_each_possible_cpu(i) - spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock); - - return 0; -} -core_initcall(init_smp_flush); - -void flush_tlb_current_task(void) -{ - struct mm_struct *mm = current->mm; - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - local_flush_tlb(); - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - preempt_enable(); -} - -void flush_tlb_mm(struct mm_struct *mm) -{ - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - if (current->active_mm == mm) { - if (current->mm) - local_flush_tlb(); - else - leave_mm(smp_processor_id()); - } - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - - preempt_enable(); -} - -void flush_tlb_page(struct vm_area_struct *vma, unsigned long va) -{ - struct mm_struct *mm = vma->vm_mm; - cpumask_t cpu_mask; - - preempt_disable(); - cpu_mask = mm->cpu_vm_mask; - cpu_clear(smp_processor_id(), cpu_mask); - - if (current->active_mm == mm) { - if (current->mm) - __flush_tlb_one(va); - else - leave_mm(smp_processor_id()); - } - - if (!cpus_empty(cpu_mask)) - flush_tlb_others(cpu_mask, mm, va); - - preempt_enable(); -} - -static void do_flush_tlb_all(void *info) -{ - unsigned long cpu = smp_processor_id(); - - __flush_tlb_all(); - if (read_pda(mmu_state) == TLBSTATE_LAZY) - leave_mm(cpu); -} - -void flush_tlb_all(void) -{ - on_each_cpu(do_flush_tlb_all, NULL, 1); -} Index: linux-2.6-tip/arch/x86/kernel/tlb_uv.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/tlb_uv.c +++ linux-2.6-tip/arch/x86/kernel/tlb_uv.c @@ -11,16 +11,15 @@ #include #include +#include #include #include #include -#include +#include #include #include #include -#include - static struct bau_control **uv_bau_table_bases __read_mostly; static int uv_bau_retry_limit __read_mostly; @@ -210,14 +209,15 @@ static int uv_wait_completion(struct bau * * Send a broadcast and wait for a broadcast message to complete. * - * The cpumaskp mask contains the cpus the broadcast was sent to. + * The flush_mask contains the cpus the broadcast was sent to. * - * Returns 1 if all remote flushing was done. The mask is zeroed. - * Returns 0 if some remote flushing remains to be done. The mask is left - * unchanged. - */ -int uv_flush_send_and_wait(int cpu, int this_blade, struct bau_desc *bau_desc, - cpumask_t *cpumaskp) + * Returns NULL if all remote flushing was done. The mask is zeroed. + * Returns @flush_mask if some remote flushing remains to be done. The + * mask will have some bits still set. + */ +const struct cpumask *uv_flush_send_and_wait(int cpu, int this_blade, + struct bau_desc *bau_desc, + struct cpumask *flush_mask) { int completion_status = 0; int right_shift; @@ -257,66 +257,76 @@ int uv_flush_send_and_wait(int cpu, int * the cpu's, all of which are still in the mask. */ __get_cpu_var(ptcstats).ptc_i++; - return 0; + return flush_mask; } /* * Success, so clear the remote cpu's from the mask so we don't * use the IPI method of shootdown on them. */ - for_each_cpu_mask(bit, *cpumaskp) { + for_each_cpu(bit, flush_mask) { blade = uv_cpu_to_blade_id(bit); if (blade == this_blade) continue; - cpu_clear(bit, *cpumaskp); + cpumask_clear_cpu(bit, flush_mask); } - if (!cpus_empty(*cpumaskp)) - return 0; - return 1; + if (!cpumask_empty(flush_mask)) + return flush_mask; + return NULL; } /** * uv_flush_tlb_others - globally purge translation cache of a virtual * address or all TLB's - * @cpumaskp: mask of all cpu's in which the address is to be removed + * @cpumask: mask of all cpu's in which the address is to be removed * @mm: mm_struct containing virtual address range * @va: virtual address to be removed (or TLB_FLUSH_ALL for all TLB's on cpu) + * @cpu: the current cpu * * This is the entry point for initiating any UV global TLB shootdown. * * Purges the translation caches of all specified processors of the given * virtual address, or purges all TLB's on specified processors. * - * The caller has derived the cpumaskp from the mm_struct and has subtracted - * the local cpu from the mask. This function is called only if there - * are bits set in the mask. (e.g. flush_tlb_page()) + * The caller has derived the cpumask from the mm_struct. This function + * is called only if there are bits set in the mask. (e.g. flush_tlb_page()) * - * The cpumaskp is converted into a nodemask of the nodes containing + * The cpumask is converted into a nodemask of the nodes containing * the cpus. * - * Returns 1 if all remote flushing was done. - * Returns 0 if some remote flushing remains to be done. - */ -int uv_flush_tlb_others(cpumask_t *cpumaskp, struct mm_struct *mm, - unsigned long va) + * Note that this function should be called with preemption disabled. + * + * Returns NULL if all remote flushing was done. + * Returns pointer to cpumask if some remote flushing remains to be + * done. The returned pointer is valid till preemption is re-enabled. + */ +const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, + struct mm_struct *mm, + unsigned long va, unsigned int cpu) { + static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); + struct cpumask *flush_mask = &__get_cpu_var(flush_tlb_mask); int i; int bit; int blade; - int cpu; + int uv_cpu; int this_blade; int locals = 0; struct bau_desc *bau_desc; - cpu = uv_blade_processor_id(); + WARN_ON(!in_atomic()); + + cpumask_andnot(flush_mask, cpumask, cpumask_of(cpu)); + + uv_cpu = uv_blade_processor_id(); this_blade = uv_numa_blade_id(); bau_desc = __get_cpu_var(bau_control).descriptor_base; - bau_desc += UV_ITEMS_PER_DESCRIPTOR * cpu; + bau_desc += UV_ITEMS_PER_DESCRIPTOR * uv_cpu; bau_nodes_clear(&bau_desc->distribution, UV_DISTRIBUTION_SIZE); i = 0; - for_each_cpu_mask(bit, *cpumaskp) { + for_each_cpu(bit, flush_mask) { blade = uv_cpu_to_blade_id(bit); BUG_ON(blade > (UV_DISTRIBUTION_SIZE - 1)); if (blade == this_blade) { @@ -331,17 +341,17 @@ int uv_flush_tlb_others(cpumask_t *cpuma * no off_node flushing; return status for local node */ if (locals) - return 0; + return flush_mask; else - return 1; + return NULL; } __get_cpu_var(ptcstats).requestor++; __get_cpu_var(ptcstats).ntargeted += i; bau_desc->payload.address = va; - bau_desc->payload.sending_cpu = smp_processor_id(); + bau_desc->payload.sending_cpu = cpu; - return uv_flush_send_and_wait(cpu, this_blade, bau_desc, cpumaskp); + return uv_flush_send_and_wait(uv_cpu, this_blade, bau_desc, flush_mask); } /* Index: linux-2.6-tip/arch/x86/kernel/trampoline_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/trampoline_32.S +++ linux-2.6-tip/arch/x86/kernel/trampoline_32.S @@ -29,7 +29,7 @@ #include #include -#include +#include /* We can free up trampoline after bootup if cpu hotplug is not supported. */ #ifndef CONFIG_HOTPLUG_CPU Index: linux-2.6-tip/arch/x86/kernel/trampoline_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/trampoline_64.S +++ linux-2.6-tip/arch/x86/kernel/trampoline_64.S @@ -25,10 +25,11 @@ */ #include -#include -#include +#include +#include #include #include +#include .section .rodata, "a", @progbits @@ -37,7 +38,7 @@ ENTRY(trampoline_data) r_base = . cli # We should be safe anyway - wbinvd + wbinvd mov %cs, %ax # Code and data in the same place mov %ax, %ds mov %ax, %es @@ -73,9 +74,8 @@ r_base = . lidtl tidt - r_base # load idt with 0, 0 lgdtl tgdt - r_base # load gdt with whatever is appropriate - xor %ax, %ax - inc %ax # protected mode (PE) bit - lmsw %ax # into protected mode + mov $X86_CR0_PE, %ax # protected mode (PE) bit + lmsw %ax # into protected mode # flush prefetch and jump to startup_32 ljmpl *(startup_32_vector - r_base) @@ -86,9 +86,8 @@ startup_32: movl $__KERNEL_DS, %eax # Initialize the %ds segment register movl %eax, %ds - xorl %eax, %eax - btsl $5, %eax # Enable PAE mode - movl %eax, %cr4 + movl $X86_CR4_PAE, %eax + movl %eax, %cr4 # Enable PAE mode # Setup trampoline 4 level pagetables leal (trampoline_level4_pgt - r_base)(%esi), %eax @@ -99,9 +98,9 @@ startup_32: xorl %edx, %edx wrmsr - xorl %eax, %eax - btsl $31, %eax # Enable paging and in turn activate Long Mode - btsl $0, %eax # Enable protected mode + # Enable paging and in turn activate Long Mode + # Enable protected mode + movl $(X86_CR0_PG | X86_CR0_PE), %eax movl %eax, %cr0 /* Index: linux-2.6-tip/arch/x86/kernel/traps.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/traps.c +++ linux-2.6-tip/arch/x86/kernel/traps.c @@ -46,6 +46,7 @@ #endif #include +#include #include #include #include @@ -54,15 +55,14 @@ #include #include -#include +#include #ifdef CONFIG_X86_64 #include #include -#include #else #include -#include +#include #include #include "cpu/mcheck/mce.h" @@ -92,9 +92,10 @@ static inline void conditional_sti(struc local_irq_enable(); } -static inline void preempt_conditional_sti(struct pt_regs *regs) +static inline void preempt_conditional_sti(struct pt_regs *regs, int stack) { - inc_preempt_count(); + if (stack) + inc_preempt_count(); if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); } @@ -105,11 +106,12 @@ static inline void conditional_cli(struc local_irq_disable(); } -static inline void preempt_conditional_cli(struct pt_regs *regs) +static inline void preempt_conditional_cli(struct pt_regs *regs, int stack) { if (regs->flags & X86_EFLAGS_IF) local_irq_disable(); - dec_preempt_count(); + if (stack) + dec_preempt_count(); } #ifdef CONFIG_X86_32 @@ -277,9 +279,9 @@ dotraplinkage void do_stack_segment(stru if (notify_die(DIE_TRAP, "stack segment", regs, error_code, 12, SIGBUS) == NOTIFY_STOP) return; - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, STACKFAULT_STACK); do_trap(12, SIGBUS, "stack segment", regs, error_code, NULL); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, STACKFAULT_STACK); } dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code) @@ -517,9 +519,9 @@ dotraplinkage void __kprobes do_int3(str return; #endif - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, DEBUG_STACK); do_trap(3, SIGTRAP, "int3", regs, error_code, NULL); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); } #ifdef CONFIG_X86_64 @@ -581,6 +583,10 @@ dotraplinkage void __kprobes do_debug(st get_debugreg(condition, 6); + /* Catch kmemcheck conditions first of all! */ + if (condition & DR_STEP && kmemcheck_trap(regs)) + return; + /* * The processor cleared BTF, so don't mark that we need it set. */ @@ -592,7 +598,7 @@ dotraplinkage void __kprobes do_debug(st return; /* It's safe to allow irq's after DR6 has been saved */ - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, DEBUG_STACK); /* Mask out spurious debug traps due to lazy DR7 setting */ if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) { @@ -627,7 +633,7 @@ dotraplinkage void __kprobes do_debug(st */ clear_dr7: set_debugreg(0, 7); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); return; #ifdef CONFIG_X86_32 @@ -642,7 +648,7 @@ debug_vm86: clear_TF_reenable: set_tsk_thread_flag(tsk, TIF_SINGLESTEP); regs->flags &= ~X86_EFLAGS_TF; - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); return; } @@ -914,19 +920,20 @@ void math_emulate(struct math_emu_info * } #endif /* CONFIG_MATH_EMULATION */ -dotraplinkage void __kprobes do_device_not_available(struct pt_regs regs) +dotraplinkage void __kprobes +do_device_not_available(struct pt_regs *regs, long error_code) { #ifdef CONFIG_X86_32 if (read_cr0() & X86_CR0_EM) { struct math_emu_info info = { }; - conditional_sti(®s); + conditional_sti(regs); - info.regs = ®s; + info.regs = regs; math_emulate(&info); } else { math_state_restore(); /* interrupts still off */ - conditional_sti(®s); + conditional_sti(regs); } #else math_state_restore(); @@ -942,7 +949,7 @@ dotraplinkage void do_iret_error(struct info.si_signo = SIGILL; info.si_errno = 0; info.si_code = ILL_BADSTK; - info.si_addr = 0; + info.si_addr = NULL; if (notify_die(DIE_TRAP, "iret exception", regs, error_code, 32, SIGILL) == NOTIFY_STOP) return; @@ -991,8 +998,13 @@ void __init trap_init(void) #endif set_intr_gate(19, &simd_coprocessor_error); + /* Reserve all the builtin and the syscall vector: */ + for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) + set_bit(i, used_vectors); + #ifdef CONFIG_IA32_EMULATION set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall); + set_bit(IA32_SYSCALL_VECTOR, used_vectors); #endif #ifdef CONFIG_X86_32 @@ -1009,23 +1021,15 @@ void __init trap_init(void) } set_system_trap_gate(SYSCALL_VECTOR, &system_call); -#endif - - /* Reserve all the builtin and the syscall vector: */ - for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) - set_bit(i, used_vectors); - -#ifdef CONFIG_X86_64 - set_bit(IA32_SYSCALL_VECTOR, used_vectors); -#else set_bit(SYSCALL_VECTOR, used_vectors); #endif + /* * Should be a barrier for any external CPU state: */ cpu_init(); #ifdef CONFIG_X86_32 - trap_init_hook(); + x86_quirk_trap_init(); #endif } Index: linux-2.6-tip/arch/x86/kernel/tsc.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/tsc.c +++ linux-2.6-tip/arch/x86/kernel/tsc.c @@ -773,7 +773,7 @@ __cpuinit int unsynchronized_tsc(void) if (!cpu_has_tsc || tsc_unstable) return 1; -#ifdef CONFIG_X86_SMP +#ifdef CONFIG_SMP if (apic_is_clustered_box()) return 1; #endif Index: linux-2.6-tip/arch/x86/kernel/visws_quirks.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/visws_quirks.c +++ linux-2.6-tip/arch/x86/kernel/visws_quirks.c @@ -24,18 +24,14 @@ #include #include -#include #include #include #include #include +#include #include #include -#include - -#include "mach_apic.h" - #include #include @@ -49,8 +45,6 @@ extern int no_broadcast; -#include - char visws_board_type = -1; char visws_board_rev = -1; @@ -200,7 +194,7 @@ static void __init MP_processor_info(str return; } - apic_cpus = apicid_to_cpu_present(m->apicid); + apic_cpus = apic->apicid_to_cpu_present(m->apicid); physids_or(phys_cpu_present_map, phys_cpu_present_map, apic_cpus); /* * Validate version @@ -649,11 +643,13 @@ out_unlock: static struct irqaction master_action = { .handler = piix4_master_intr, .name = "PIIX4-8259", + .flags = IRQF_NODELAY, }; static struct irqaction cascade_action = { .handler = no_action, .name = "cascade", + .flags = IRQF_NODELAY, }; Index: linux-2.6-tip/arch/x86/kernel/vm86_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vm86_32.c +++ linux-2.6-tip/arch/x86/kernel/vm86_32.c @@ -137,6 +137,7 @@ struct pt_regs *save_v86_state(struct ke local_irq_enable(); if (!current->thread.vm86_info) { + local_irq_disable(); printk("no vm86_info: BAD\n"); do_exit(SIGSEGV); } @@ -158,7 +159,7 @@ struct pt_regs *save_v86_state(struct ke ret = KVM86->regs32; ret->fs = current->thread.saved_fs; - loadsegment(gs, current->thread.saved_gs); + set_user_gs(ret, current->thread.saved_gs); return ret; } @@ -197,9 +198,9 @@ out: static int do_vm86_irq_handling(int subfunction, int irqnumber); static void do_sys_vm86(struct kernel_vm86_struct *info, struct task_struct *tsk); -asmlinkage int sys_vm86old(struct pt_regs regs) +int sys_vm86old(struct pt_regs *regs) { - struct vm86_struct __user *v86 = (struct vm86_struct __user *)regs.bx; + struct vm86_struct __user *v86 = (struct vm86_struct __user *)regs->bx; struct kernel_vm86_struct info; /* declare this _on top_, * this avoids wasting of stack space. * This remains on the stack until we @@ -218,7 +219,7 @@ asmlinkage int sys_vm86old(struct pt_reg if (tmp) goto out; memset(&info.vm86plus, 0, (int)&info.regs32 - (int)&info.vm86plus); - info.regs32 = ®s; + info.regs32 = regs; tsk->thread.vm86_info = v86; do_sys_vm86(&info, tsk); ret = 0; /* we never return here */ @@ -227,7 +228,7 @@ out: } -asmlinkage int sys_vm86(struct pt_regs regs) +int sys_vm86(struct pt_regs *regs) { struct kernel_vm86_struct info; /* declare this _on top_, * this avoids wasting of stack space. @@ -239,12 +240,12 @@ asmlinkage int sys_vm86(struct pt_regs r struct vm86plus_struct __user *v86; tsk = current; - switch (regs.bx) { + switch (regs->bx) { case VM86_REQUEST_IRQ: case VM86_FREE_IRQ: case VM86_GET_IRQ_BITS: case VM86_GET_AND_RESET_IRQ: - ret = do_vm86_irq_handling(regs.bx, (int)regs.cx); + ret = do_vm86_irq_handling(regs->bx, (int)regs->cx); goto out; case VM86_PLUS_INSTALL_CHECK: /* @@ -261,14 +262,14 @@ asmlinkage int sys_vm86(struct pt_regs r ret = -EPERM; if (tsk->thread.saved_sp0) goto out; - v86 = (struct vm86plus_struct __user *)regs.cx; + v86 = (struct vm86plus_struct __user *)regs->cx; tmp = copy_vm86_regs_from_user(&info.regs, &v86->regs, offsetof(struct kernel_vm86_struct, regs32) - sizeof(info.regs)); ret = -EFAULT; if (tmp) goto out; - info.regs32 = ®s; + info.regs32 = regs; info.vm86plus.is_vm86pus = 1; tsk->thread.vm86_info = (struct vm86_struct __user *)v86; do_sys_vm86(&info, tsk); @@ -323,7 +324,7 @@ static void do_sys_vm86(struct kernel_vm info->regs32->ax = 0; tsk->thread.saved_sp0 = tsk->thread.sp0; tsk->thread.saved_fs = info->regs32->fs; - savesegment(gs, tsk->thread.saved_gs); + tsk->thread.saved_gs = get_user_gs(info->regs32); tss = &per_cpu(init_tss, get_cpu()); tsk->thread.sp0 = (unsigned long) &info->VM86_TSS_ESP0; Index: linux-2.6-tip/arch/x86/kernel/vmi_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vmi_32.c +++ linux-2.6-tip/arch/x86/kernel/vmi_32.c @@ -680,10 +680,11 @@ static inline int __init activate_vmi(vo para_fill(pv_mmu_ops.write_cr2, SetCR2); para_fill(pv_mmu_ops.write_cr3, SetCR3); para_fill(pv_cpu_ops.write_cr4, SetCR4); - para_fill(pv_irq_ops.save_fl, GetInterruptMask); - para_fill(pv_irq_ops.restore_fl, SetInterruptMask); - para_fill(pv_irq_ops.irq_disable, DisableInterrupts); - para_fill(pv_irq_ops.irq_enable, EnableInterrupts); + + para_fill(pv_irq_ops.save_fl.func, GetInterruptMask); + para_fill(pv_irq_ops.restore_fl.func, SetInterruptMask); + para_fill(pv_irq_ops.irq_disable.func, DisableInterrupts); + para_fill(pv_irq_ops.irq_enable.func, EnableInterrupts); para_fill(pv_cpu_ops.wbinvd, WBINVD); para_fill(pv_cpu_ops.read_tsc, RDTSC); @@ -797,8 +798,8 @@ static inline int __init activate_vmi(vo #endif #ifdef CONFIG_X86_LOCAL_APIC - para_fill(apic_ops->read, APICRead); - para_fill(apic_ops->write, APICWrite); + para_fill(apic->read, APICRead); + para_fill(apic->write, APICWrite); #endif /* Index: linux-2.6-tip/arch/x86/kernel/vmiclock_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vmiclock_32.c +++ linux-2.6-tip/arch/x86/kernel/vmiclock_32.c @@ -28,7 +28,6 @@ #include #include -#include #include #include #include @@ -256,7 +255,7 @@ void __devinit vmi_time_bsp_init(void) */ clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL); local_irq_disable(); -#ifdef CONFIG_X86_SMP +#ifdef CONFIG_SMP /* * XXX handle_percpu_irq only defined for SMP; we need to switch over * to using it, since this is a local interrupt, which each CPU must @@ -288,8 +287,7 @@ static struct clocksource clocksource_vm static cycle_t read_real_cycles(void) { cycle_t ret = (cycle_t)vmi_timer_ops.get_cycle_counter(VMI_CYCLES_REAL); - return ret >= clocksource_vmi.cycle_last ? - ret : clocksource_vmi.cycle_last; + return max(ret, clocksource_vmi.cycle_last); } static struct clocksource clocksource_vmi = { Index: linux-2.6-tip/arch/x86/kernel/vmlinux_32.lds.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vmlinux_32.lds.S +++ linux-2.6-tip/arch/x86/kernel/vmlinux_32.lds.S @@ -12,7 +12,7 @@ #include #include -#include +#include #include #include @@ -178,14 +178,7 @@ SECTIONS __initramfs_end = .; } #endif - . = ALIGN(PAGE_SIZE); - .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { - __per_cpu_start = .; - *(.data.percpu.page_aligned) - *(.data.percpu) - *(.data.percpu.shared_aligned) - __per_cpu_end = .; - } + PERCPU(PAGE_SIZE) . = ALIGN(PAGE_SIZE); /* freed after init ends here */ Index: linux-2.6-tip/arch/x86/kernel/vmlinux_64.lds.S =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vmlinux_64.lds.S +++ linux-2.6-tip/arch/x86/kernel/vmlinux_64.lds.S @@ -5,7 +5,8 @@ #define LOAD_OFFSET __START_KERNEL_map #include -#include +#include +#include #undef i386 /* in case the preprocessor is a 32bit one */ @@ -13,12 +14,15 @@ OUTPUT_FORMAT("elf64-x86-64", "elf64-x86 OUTPUT_ARCH(i386:x86-64) ENTRY(phys_startup_64) jiffies_64 = jiffies; -_proxy_pda = 1; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ data PT_LOAD FLAGS(7); /* RWE */ user PT_LOAD FLAGS(7); /* RWE */ data.init PT_LOAD FLAGS(7); /* RWE */ +#ifdef CONFIG_SMP + percpu PT_LOAD FLAGS(7); /* RWE */ +#endif + data.init2 PT_LOAD FLAGS(7); /* RWE */ note PT_NOTE FLAGS(0); /* ___ */ } SECTIONS @@ -208,14 +212,28 @@ SECTIONS __initramfs_end = .; #endif +#ifdef CONFIG_SMP + /* + * percpu offsets are zero-based on SMP. PERCPU_VADDR() changes the + * output PHDR, so the next output section - __data_nosave - should + * start another section data.init2. Also, pda should be at the head of + * percpu area. Preallocate it and define the percpu offset symbol + * so that it can be accessed as a percpu variable. + */ + . = ALIGN(PAGE_SIZE); + PERCPU_VADDR(0, :percpu) +#else PERCPU(PAGE_SIZE) +#endif . = ALIGN(PAGE_SIZE); __init_end = .; . = ALIGN(PAGE_SIZE); __nosave_begin = .; - .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) { *(.data.nosave) } + .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) { + *(.data.nosave) + } :data.init2 /* use another section data.init2, see PERCPU_VADDR() above */ . = ALIGN(PAGE_SIZE); __nosave_end = .; @@ -239,8 +257,21 @@ SECTIONS DWARF_DEBUG } + /* + * Per-cpu symbols which need to be offset from __per_cpu_load + * for the boot processor. + */ +#define INIT_PER_CPU(x) init_per_cpu__##x = per_cpu__##x + __per_cpu_load +INIT_PER_CPU(gdt_page); +INIT_PER_CPU(irq_stack_union); + /* * Build-time check on the image size: */ ASSERT((_end - _text <= KERNEL_IMAGE_SIZE), "kernel image bigger than KERNEL_IMAGE_SIZE") + +#ifdef CONFIG_SMP +ASSERT((per_cpu__irq_stack_union == 0), + "irq_stack_union is not at start of per-cpu area"); +#endif Index: linux-2.6-tip/arch/x86/kernel/vsmp_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vsmp_64.c +++ linux-2.6-tip/arch/x86/kernel/vsmp_64.c @@ -37,6 +37,7 @@ static unsigned long vsmp_save_fl(void) flags &= ~X86_EFLAGS_IF; return flags; } +PV_CALLEE_SAVE_REGS_THUNK(vsmp_save_fl); static void vsmp_restore_fl(unsigned long flags) { @@ -46,6 +47,7 @@ static void vsmp_restore_fl(unsigned lon flags |= X86_EFLAGS_AC; native_restore_fl(flags); } +PV_CALLEE_SAVE_REGS_THUNK(vsmp_restore_fl); static void vsmp_irq_disable(void) { @@ -53,6 +55,7 @@ static void vsmp_irq_disable(void) native_restore_fl((flags & ~X86_EFLAGS_IF) | X86_EFLAGS_AC); } +PV_CALLEE_SAVE_REGS_THUNK(vsmp_irq_disable); static void vsmp_irq_enable(void) { @@ -60,6 +63,7 @@ static void vsmp_irq_enable(void) native_restore_fl((flags | X86_EFLAGS_IF) & (~X86_EFLAGS_AC)); } +PV_CALLEE_SAVE_REGS_THUNK(vsmp_irq_enable); static unsigned __init_or_module vsmp_patch(u8 type, u16 clobbers, void *ibuf, unsigned long addr, unsigned len) @@ -90,10 +94,10 @@ static void __init set_vsmp_pv_ops(void) cap, ctl); if (cap & ctl & (1 << 4)) { /* Setup irq ops and turn on vSMP IRQ fastpath handling */ - pv_irq_ops.irq_disable = vsmp_irq_disable; - pv_irq_ops.irq_enable = vsmp_irq_enable; - pv_irq_ops.save_fl = vsmp_save_fl; - pv_irq_ops.restore_fl = vsmp_restore_fl; + pv_irq_ops.irq_disable = PV_CALLEE_SAVE(vsmp_irq_disable); + pv_irq_ops.irq_enable = PV_CALLEE_SAVE(vsmp_irq_enable); + pv_irq_ops.save_fl = PV_CALLEE_SAVE(vsmp_save_fl); + pv_irq_ops.restore_fl = PV_CALLEE_SAVE(vsmp_restore_fl); pv_init_ops.patch = vsmp_patch; ctl &= ~(1 << 4); Index: linux-2.6-tip/arch/x86/kernel/x8664_ksyms_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/x8664_ksyms_64.c +++ linux-2.6-tip/arch/x86/kernel/x8664_ksyms_64.c @@ -58,5 +58,3 @@ EXPORT_SYMBOL(__memcpy); EXPORT_SYMBOL(empty_zero_page); EXPORT_SYMBOL(init_level4_pgt); EXPORT_SYMBOL(load_gs_index); - -EXPORT_SYMBOL(_proxy_pda); Index: linux-2.6-tip/arch/x86/kvm/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/kvm/Kconfig +++ linux-2.6-tip/arch/x86/kvm/Kconfig @@ -55,7 +55,8 @@ config KVM_AMD config KVM_TRACE bool "KVM trace support" - depends on KVM && MARKERS && SYSFS + depends on KVM && SYSFS + select MARKERS select RELAY select DEBUG_FS default n Index: linux-2.6-tip/arch/x86/lguest/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/lguest/Kconfig +++ linux-2.6-tip/arch/x86/lguest/Kconfig @@ -3,7 +3,6 @@ config LGUEST_GUEST select PARAVIRT depends on X86_32 depends on !X86_PAE - depends on !X86_VOYAGER select VIRTIO select VIRTIO_RING select VIRTIO_CONSOLE Index: linux-2.6-tip/arch/x86/lguest/boot.c =================================================================== --- linux-2.6-tip.orig/arch/x86/lguest/boot.c +++ linux-2.6-tip/arch/x86/lguest/boot.c @@ -173,24 +173,29 @@ static unsigned long save_fl(void) { return lguest_data.irq_enabled; } +PV_CALLEE_SAVE_REGS_THUNK(save_fl); /* restore_flags() just sets the flags back to the value given. */ static void restore_fl(unsigned long flags) { lguest_data.irq_enabled = flags; } +PV_CALLEE_SAVE_REGS_THUNK(restore_fl); /* Interrupts go off... */ static void irq_disable(void) { lguest_data.irq_enabled = 0; } +PV_CALLEE_SAVE_REGS_THUNK(irq_disable); /* Interrupts go on... */ static void irq_enable(void) { lguest_data.irq_enabled = X86_EFLAGS_IF; } +PV_CALLEE_SAVE_REGS_THUNK(irq_enable); + /*:*/ /*M:003 Note that we don't check for outstanding interrupts when we re-enable * them (or when we unmask an interrupt). This seems to work for the moment, @@ -278,7 +283,7 @@ static void lguest_load_tls(struct threa /* There's one problem which normal hardware doesn't have: the Host * can't handle us removing entries we're currently using. So we clear * the GS register here: if it's needed it'll be reloaded anyway. */ - loadsegment(gs, 0); + lazy_load_gs(0); lazy_hcall(LHCALL_LOAD_TLS, __pa(&t->tls_array), cpu, 0); } @@ -823,13 +828,14 @@ static u32 lguest_apic_safe_wait_icr_idl return 0; } -static struct apic_ops lguest_basic_apic_ops = { - .read = lguest_apic_read, - .write = lguest_apic_write, - .icr_read = lguest_apic_icr_read, - .icr_write = lguest_apic_icr_write, - .wait_icr_idle = lguest_apic_wait_icr_idle, - .safe_wait_icr_idle = lguest_apic_safe_wait_icr_idle, +static void set_lguest_basic_apic_ops(void) +{ + apic->read = lguest_apic_read; + apic->write = lguest_apic_write; + apic->icr_read = lguest_apic_icr_read; + apic->icr_write = lguest_apic_icr_write; + apic->wait_icr_idle = lguest_apic_wait_icr_idle; + apic->safe_wait_icr_idle = lguest_apic_safe_wait_icr_idle; }; #endif @@ -984,10 +990,10 @@ __init void lguest_init(void) /* interrupt-related operations */ pv_irq_ops.init_IRQ = lguest_init_IRQ; - pv_irq_ops.save_fl = save_fl; - pv_irq_ops.restore_fl = restore_fl; - pv_irq_ops.irq_disable = irq_disable; - pv_irq_ops.irq_enable = irq_enable; + pv_irq_ops.save_fl = PV_CALLEE_SAVE(save_fl); + pv_irq_ops.restore_fl = PV_CALLEE_SAVE(restore_fl); + pv_irq_ops.irq_disable = PV_CALLEE_SAVE(irq_disable); + pv_irq_ops.irq_enable = PV_CALLEE_SAVE(irq_enable); pv_irq_ops.safe_halt = lguest_safe_halt; /* init-time operations */ @@ -1030,7 +1036,7 @@ __init void lguest_init(void) #ifdef CONFIG_X86_LOCAL_APIC /* apic read/write intercepts */ - apic_ops = &lguest_basic_apic_ops; + set_lguest_basic_apic_ops(); #endif /* time operations */ Index: linux-2.6-tip/arch/x86/lib/getuser.S =================================================================== --- linux-2.6-tip.orig/arch/x86/lib/getuser.S +++ linux-2.6-tip/arch/x86/lib/getuser.S @@ -28,7 +28,7 @@ #include #include -#include +#include #include #include #include Index: linux-2.6-tip/arch/x86/mach-default/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-default/Makefile +++ /dev/null @@ -1,5 +0,0 @@ -# -# Makefile for the linux kernel. -# - -obj-y := setup.o Index: linux-2.6-tip/arch/x86/mach-default/setup.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-default/setup.c +++ /dev/null @@ -1,174 +0,0 @@ -/* - * Machine specific setup for generic - */ - -#include -#include -#include -#include -#include -#include -#include - -#include - -#ifdef CONFIG_HOTPLUG_CPU -#define DEFAULT_SEND_IPI (1) -#else -#define DEFAULT_SEND_IPI (0) -#endif - -int no_broadcast = DEFAULT_SEND_IPI; - -/** - * pre_intr_init_hook - initialisation prior to setting up interrupt vectors - * - * Description: - * Perform any necessary interrupt initialisation prior to setting up - * the "ordinary" interrupt call gates. For legacy reasons, the ISA - * interrupts should be initialised here if the machine emulates a PC - * in any way. - **/ -void __init pre_intr_init_hook(void) -{ - if (x86_quirks->arch_pre_intr_init) { - if (x86_quirks->arch_pre_intr_init()) - return; - } - init_ISA_irqs(); -} - -/* - * IRQ2 is cascade interrupt to second interrupt controller - */ -static struct irqaction irq2 = { - .handler = no_action, - .mask = CPU_MASK_NONE, - .name = "cascade", -}; - -/** - * intr_init_hook - post gate setup interrupt initialisation - * - * Description: - * Fill in any interrupts that may have been left out by the general - * init_IRQ() routine. interrupts having to do with the machine rather - * than the devices on the I/O bus (like APIC interrupts in intel MP - * systems) are started here. - **/ -void __init intr_init_hook(void) -{ - if (x86_quirks->arch_intr_init) { - if (x86_quirks->arch_intr_init()) - return; - } - if (!acpi_ioapic) - setup_irq(2, &irq2); - -} - -/** - * pre_setup_arch_hook - hook called prior to any setup_arch() execution - * - * Description: - * generally used to activate any machine specific identification - * routines that may be needed before setup_arch() runs. On Voyager - * this is used to get the board revision and type. - **/ -void __init pre_setup_arch_hook(void) -{ -} - -/** - * trap_init_hook - initialise system specific traps - * - * Description: - * Called as the final act of trap_init(). Used in VISWS to initialise - * the various board specific APIC traps. - **/ -void __init trap_init_hook(void) -{ - if (x86_quirks->arch_trap_init) { - if (x86_quirks->arch_trap_init()) - return; - } -} - -static struct irqaction irq0 = { - .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER, - .mask = CPU_MASK_NONE, - .name = "timer" -}; - -/** - * pre_time_init_hook - do any specific initialisations before. - * - **/ -void __init pre_time_init_hook(void) -{ - if (x86_quirks->arch_pre_time_init) - x86_quirks->arch_pre_time_init(); -} - -/** - * time_init_hook - do any specific initialisations for the system timer. - * - * Description: - * Must plug the system timer interrupt source at HZ into the IRQ listed - * in irq_vectors.h:TIMER_IRQ - **/ -void __init time_init_hook(void) -{ - if (x86_quirks->arch_time_init) { - /* - * A nonzero return code does not mean failure, it means - * that the architecture quirk does not want any - * generic (timer) setup to be performed after this: - */ - if (x86_quirks->arch_time_init()) - return; - } - - irq0.mask = cpumask_of_cpu(0); - setup_irq(0, &irq0); -} - -#ifdef CONFIG_MCA -/** - * mca_nmi_hook - hook into MCA specific NMI chain - * - * Description: - * The MCA (Microchannel Architecture) has an NMI chain for NMI sources - * along the MCA bus. Use this to hook into that chain if you will need - * it. - **/ -void mca_nmi_hook(void) -{ - /* - * If I recall correctly, there's a whole bunch of other things that - * we can do to check for NMI problems, but that's all I know about - * at the moment. - */ - pr_warning("NMI generated from unknown source!\n"); -} -#endif - -static __init int no_ipi_broadcast(char *str) -{ - get_option(&str, &no_broadcast); - pr_info("Using %s mode\n", - no_broadcast ? "No IPI Broadcast" : "IPI Broadcast"); - return 1; -} -__setup("no_ipi_broadcast=", no_ipi_broadcast); - -static int __init print_ipi_mode(void) -{ - pr_info("Using IPI %s mode\n", - no_broadcast ? "No-Shortcut" : "Shortcut"); - return 0; -} - -late_initcall(print_ipi_mode); - Index: linux-2.6-tip/arch/x86/mach-generic/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/Makefile +++ /dev/null @@ -1,11 +0,0 @@ -# -# Makefile for the generic architecture -# - -EXTRA_CFLAGS := -Iarch/x86/kernel - -obj-y := probe.o default.o -obj-$(CONFIG_X86_NUMAQ) += numaq.o -obj-$(CONFIG_X86_SUMMIT) += summit.o -obj-$(CONFIG_X86_BIGSMP) += bigsmp.o -obj-$(CONFIG_X86_ES7000) += es7000.o Index: linux-2.6-tip/arch/x86/mach-generic/bigsmp.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/bigsmp.c +++ /dev/null @@ -1,60 +0,0 @@ -/* - * APIC driver for "bigsmp" XAPIC machines with more than 8 virtual CPUs. - * Drives the local APIC in "clustered mode". - */ -#define APIC_DEFINITION 1 -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static int dmi_bigsmp; /* can be set by dmi scanners */ - -static int hp_ht_bigsmp(const struct dmi_system_id *d) -{ - printk(KERN_NOTICE "%s detected: force use of apic=bigsmp\n", d->ident); - dmi_bigsmp = 1; - return 0; -} - - -static const struct dmi_system_id bigsmp_dmi_table[] = { - { hp_ht_bigsmp, "HP ProLiant DL760 G2", - { DMI_MATCH(DMI_BIOS_VENDOR, "HP"), - DMI_MATCH(DMI_BIOS_VERSION, "P44-"),} - }, - - { hp_ht_bigsmp, "HP ProLiant DL740", - { DMI_MATCH(DMI_BIOS_VENDOR, "HP"), - DMI_MATCH(DMI_BIOS_VERSION, "P47-"),} - }, - { } -}; - -static void vector_allocation_domain(int cpu, cpumask_t *retmask) -{ - cpus_clear(*retmask); - cpu_set(cpu, *retmask); -} - -static int probe_bigsmp(void) -{ - if (def_to_bigsmp) - dmi_bigsmp = 1; - else - dmi_check_system(bigsmp_dmi_table); - return dmi_bigsmp; -} - -struct genapic apic_bigsmp = APIC_INIT("bigsmp", probe_bigsmp); Index: linux-2.6-tip/arch/x86/mach-generic/default.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/default.c +++ /dev/null @@ -1,27 +0,0 @@ -/* - * Default generic APIC driver. This handles up to 8 CPUs. - */ -#define APIC_DEFINITION 1 -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* should be called last. */ -static int probe_default(void) -{ - return 1; -} - -struct genapic apic_default = APIC_INIT("default", probe_default); Index: linux-2.6-tip/arch/x86/mach-generic/es7000.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/es7000.c +++ /dev/null @@ -1,103 +0,0 @@ -/* - * APIC driver for the Unisys ES7000 chipset. - */ -#define APIC_DEFINITION 1 -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -void __init es7000_update_genapic_to_cluster(void) -{ - genapic->target_cpus = target_cpus_cluster; - genapic->int_delivery_mode = INT_DELIVERY_MODE_CLUSTER; - genapic->int_dest_mode = INT_DEST_MODE_CLUSTER; - genapic->no_balance_irq = NO_BALANCE_IRQ_CLUSTER; - - genapic->init_apic_ldr = init_apic_ldr_cluster; - - genapic->cpu_mask_to_apicid = cpu_mask_to_apicid_cluster; -} - -static int probe_es7000(void) -{ - /* probed later in mptable/ACPI hooks */ - return 0; -} - -extern void es7000_sw_apic(void); -static void __init enable_apic_mode(void) -{ - es7000_sw_apic(); - return; -} - -static __init int -mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) -{ - if (mpc->oemptr) { - struct mpc_oemtable *oem_table = - (struct mpc_oemtable *)mpc->oemptr; - if (!strncmp(oem, "UNISYS", 6)) - return parse_unisys_oem((char *)oem_table); - } - return 0; -} - -#ifdef CONFIG_ACPI -/* Hook from generic ACPI tables.c */ -static int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - unsigned long oem_addr = 0; - int check_dsdt; - int ret = 0; - - /* check dsdt at first to avoid clear fix_map for oem_addr */ - check_dsdt = es7000_check_dsdt(); - - if (!find_unisys_acpi_oem_table(&oem_addr)) { - if (check_dsdt) - ret = parse_unisys_oem((char *)oem_addr); - else { - setup_unisys(); - ret = 1; - } - /* - * we need to unmap it - */ - unmap_unisys_acpi_oem_table(oem_addr); - } - return ret; -} -#else -static int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - return 0; -} -#endif - -static void vector_allocation_domain(int cpu, cpumask_t *retmask) -{ - /* Careful. Some cpus do not strictly honor the set of cpus - * specified in the interrupt destination when using lowest - * priority interrupt delivery mode. - * - * In particular there was a hyperthreading cpu observed to - * deliver interrupts to the wrong hyperthread when only one - * hyperthread was specified in the interrupt desitination. - */ - *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; -} - -struct genapic __initdata_refok apic_es7000 = APIC_INIT("es7000", probe_es7000); Index: linux-2.6-tip/arch/x86/mach-generic/numaq.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/numaq.c +++ /dev/null @@ -1,53 +0,0 @@ -/* - * APIC driver for the IBM NUMAQ chipset. - */ -#define APIC_DEFINITION 1 -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static int mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) -{ - numaq_mps_oem_check(mpc, oem, productid); - return found_numaq; -} - -static int probe_numaq(void) -{ - /* already know from get_memcfg_numaq() */ - return found_numaq; -} - -/* Hook from generic ACPI tables.c */ -static int acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - return 0; -} - -static void vector_allocation_domain(int cpu, cpumask_t *retmask) -{ - /* Careful. Some cpus do not strictly honor the set of cpus - * specified in the interrupt destination when using lowest - * priority interrupt delivery mode. - * - * In particular there was a hyperthreading cpu observed to - * deliver interrupts to the wrong hyperthread when only one - * hyperthread was specified in the interrupt desitination. - */ - *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; -} - -struct genapic apic_numaq = APIC_INIT("NUMAQ", probe_numaq); Index: linux-2.6-tip/arch/x86/mach-generic/probe.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/probe.c +++ /dev/null @@ -1,152 +0,0 @@ -/* - * Copyright 2003 Andi Kleen, SuSE Labs. - * Subject to the GNU Public License, v.2 - * - * Generic x86 APIC driver probe layer. - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -extern struct genapic apic_numaq; -extern struct genapic apic_summit; -extern struct genapic apic_bigsmp; -extern struct genapic apic_es7000; -extern struct genapic apic_default; - -struct genapic *genapic = &apic_default; - -static struct genapic *apic_probe[] __initdata = { -#ifdef CONFIG_X86_NUMAQ - &apic_numaq, -#endif -#ifdef CONFIG_X86_SUMMIT - &apic_summit, -#endif -#ifdef CONFIG_X86_BIGSMP - &apic_bigsmp, -#endif -#ifdef CONFIG_X86_ES7000 - &apic_es7000, -#endif - &apic_default, /* must be last */ - NULL, -}; - -static int cmdline_apic __initdata; -static int __init parse_apic(char *arg) -{ - int i; - - if (!arg) - return -EINVAL; - - for (i = 0; apic_probe[i]; i++) { - if (!strcmp(apic_probe[i]->name, arg)) { - genapic = apic_probe[i]; - cmdline_apic = 1; - return 0; - } - } - - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); - - /* Parsed again by __setup for debug/verbose */ - return 0; -} -early_param("apic", parse_apic); - -void __init generic_bigsmp_probe(void) -{ -#ifdef CONFIG_X86_BIGSMP - /* - * This routine is used to switch to bigsmp mode when - * - There is no apic= option specified by the user - * - generic_apic_probe() has chosen apic_default as the sub_arch - * - we find more than 8 CPUs in acpi LAPIC listing with xAPIC support - */ - - if (!cmdline_apic && genapic == &apic_default) { - if (apic_bigsmp.probe()) { - genapic = &apic_bigsmp; - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); - printk(KERN_INFO "Overriding APIC driver with %s\n", - genapic->name); - } - } -#endif -} - -void __init generic_apic_probe(void) -{ - if (!cmdline_apic) { - int i; - for (i = 0; apic_probe[i]; i++) { - if (apic_probe[i]->probe()) { - genapic = apic_probe[i]; - break; - } - } - /* Not visible without early console */ - if (!apic_probe[i]) - panic("Didn't find an APIC driver"); - - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); - } - printk(KERN_INFO "Using APIC driver %s\n", genapic->name); -} - -/* These functions can switch the APIC even after the initial ->probe() */ - -int __init mps_oem_check(struct mpc_table *mpc, char *oem, char *productid) -{ - int i; - for (i = 0; apic_probe[i]; ++i) { - if (apic_probe[i]->mps_oem_check(mpc, oem, productid)) { - if (!cmdline_apic) { - genapic = apic_probe[i]; - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); - printk(KERN_INFO "Switched to APIC driver `%s'.\n", - genapic->name); - } - return 1; - } - } - return 0; -} - -int __init acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - int i; - for (i = 0; apic_probe[i]; ++i) { - if (apic_probe[i]->acpi_madt_oem_check(oem_id, oem_table_id)) { - if (!cmdline_apic) { - genapic = apic_probe[i]; - if (x86_quirks->update_genapic) - x86_quirks->update_genapic(); - printk(KERN_INFO "Switched to APIC driver `%s'.\n", - genapic->name); - } - return 1; - } - } - return 0; -} - -int hard_smp_processor_id(void) -{ - return genapic->get_apic_id(*(unsigned long *)(APIC_BASE+APIC_ID)); -} Index: linux-2.6-tip/arch/x86/mach-generic/summit.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-generic/summit.c +++ /dev/null @@ -1,40 +0,0 @@ -/* - * APIC driver for the IBM "Summit" chipset. - */ -#define APIC_DEFINITION 1 -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static int probe_summit(void) -{ - /* probed later in mptable/ACPI hooks */ - return 0; -} - -static void vector_allocation_domain(int cpu, cpumask_t *retmask) -{ - /* Careful. Some cpus do not strictly honor the set of cpus - * specified in the interrupt destination when using lowest - * priority interrupt delivery mode. - * - * In particular there was a hyperthreading cpu observed to - * deliver interrupts to the wrong hyperthread when only one - * hyperthread was specified in the interrupt desitination. - */ - *retmask = (cpumask_t){ { [0] = APIC_ALL_CPUS, } }; -} - -struct genapic apic_summit = APIC_INIT("summit", probe_summit); Index: linux-2.6-tip/arch/x86/mach-rdc321x/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-rdc321x/Makefile +++ /dev/null @@ -1,5 +0,0 @@ -# -# Makefile for the RDC321x specific parts of the kernel -# -obj-$(CONFIG_X86_RDC321X) := gpio.o platform.o - Index: linux-2.6-tip/arch/x86/mach-rdc321x/gpio.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-rdc321x/gpio.c +++ /dev/null @@ -1,194 +0,0 @@ -/* - * GPIO support for RDC SoC R3210/R8610 - * - * Copyright (C) 2007, Florian Fainelli - * Copyright (C) 2008, Volker Weiss - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - */ - - -#include -#include -#include -#include - -#include -#include - - -/* spin lock to protect our private copy of GPIO data register plus - the access to PCI conf registers. */ -static DEFINE_SPINLOCK(gpio_lock); - -/* copy of GPIO data registers */ -static u32 gpio_data_reg1; -static u32 gpio_data_reg2; - -static u32 gpio_request_data[2]; - - -static inline void rdc321x_conf_write(unsigned addr, u32 value) -{ - outl((1 << 31) | (7 << 11) | addr, RDC3210_CFGREG_ADDR); - outl(value, RDC3210_CFGREG_DATA); -} - -static inline void rdc321x_conf_or(unsigned addr, u32 value) -{ - outl((1 << 31) | (7 << 11) | addr, RDC3210_CFGREG_ADDR); - value |= inl(RDC3210_CFGREG_DATA); - outl(value, RDC3210_CFGREG_DATA); -} - -static inline u32 rdc321x_conf_read(unsigned addr) -{ - outl((1 << 31) | (7 << 11) | addr, RDC3210_CFGREG_ADDR); - - return inl(RDC3210_CFGREG_DATA); -} - -/* configure pin as GPIO */ -static void rdc321x_configure_gpio(unsigned gpio) -{ - unsigned long flags; - - spin_lock_irqsave(&gpio_lock, flags); - rdc321x_conf_or(gpio < 32 - ? RDC321X_GPIO_CTRL_REG1 : RDC321X_GPIO_CTRL_REG2, - 1 << (gpio & 0x1f)); - spin_unlock_irqrestore(&gpio_lock, flags); -} - -/* initially setup the 2 copies of the gpio data registers. - This function must be called by the platform setup code. */ -void __init rdc321x_gpio_setup() -{ - /* this might not be, what others (BIOS, bootloader, etc.) - wrote to these registers before, but it's a good guess. Still - better than just using 0xffffffff. */ - - gpio_data_reg1 = rdc321x_conf_read(RDC321X_GPIO_DATA_REG1); - gpio_data_reg2 = rdc321x_conf_read(RDC321X_GPIO_DATA_REG2); -} - -/* determine, if gpio number is valid */ -static inline int rdc321x_is_gpio(unsigned gpio) -{ - return gpio <= RDC321X_MAX_GPIO; -} - -/* request GPIO */ -int rdc_gpio_request(unsigned gpio, const char *label) -{ - unsigned long flags; - - if (!rdc321x_is_gpio(gpio)) - return -EINVAL; - - spin_lock_irqsave(&gpio_lock, flags); - if (gpio_request_data[(gpio & 0x20) ? 1 : 0] & (1 << (gpio & 0x1f))) - goto inuse; - gpio_request_data[(gpio & 0x20) ? 1 : 0] |= (1 << (gpio & 0x1f)); - spin_unlock_irqrestore(&gpio_lock, flags); - - return 0; -inuse: - spin_unlock_irqrestore(&gpio_lock, flags); - return -EINVAL; -} -EXPORT_SYMBOL(rdc_gpio_request); - -/* release previously-claimed GPIO */ -void rdc_gpio_free(unsigned gpio) -{ - unsigned long flags; - - if (!rdc321x_is_gpio(gpio)) - return; - - spin_lock_irqsave(&gpio_lock, flags); - gpio_request_data[(gpio & 0x20) ? 1 : 0] &= ~(1 << (gpio & 0x1f)); - spin_unlock_irqrestore(&gpio_lock, flags); -} -EXPORT_SYMBOL(rdc_gpio_free); - -/* read GPIO pin */ -int rdc_gpio_get_value(unsigned gpio) -{ - u32 reg; - unsigned long flags; - - spin_lock_irqsave(&gpio_lock, flags); - reg = rdc321x_conf_read(gpio < 32 - ? RDC321X_GPIO_DATA_REG1 : RDC321X_GPIO_DATA_REG2); - spin_unlock_irqrestore(&gpio_lock, flags); - - return (1 << (gpio & 0x1f)) & reg ? 1 : 0; -} -EXPORT_SYMBOL(rdc_gpio_get_value); - -/* set GPIO pin to value */ -void rdc_gpio_set_value(unsigned gpio, int value) -{ - unsigned long flags; - u32 reg; - - reg = 1 << (gpio & 0x1f); - if (gpio < 32) { - spin_lock_irqsave(&gpio_lock, flags); - if (value) - gpio_data_reg1 |= reg; - else - gpio_data_reg1 &= ~reg; - rdc321x_conf_write(RDC321X_GPIO_DATA_REG1, gpio_data_reg1); - spin_unlock_irqrestore(&gpio_lock, flags); - } else { - spin_lock_irqsave(&gpio_lock, flags); - if (value) - gpio_data_reg2 |= reg; - else - gpio_data_reg2 &= ~reg; - rdc321x_conf_write(RDC321X_GPIO_DATA_REG2, gpio_data_reg2); - spin_unlock_irqrestore(&gpio_lock, flags); - } -} -EXPORT_SYMBOL(rdc_gpio_set_value); - -/* configure GPIO pin as input */ -int rdc_gpio_direction_input(unsigned gpio) -{ - if (!rdc321x_is_gpio(gpio)) - return -EINVAL; - - rdc321x_configure_gpio(gpio); - - return 0; -} -EXPORT_SYMBOL(rdc_gpio_direction_input); - -/* configure GPIO pin as output and set value */ -int rdc_gpio_direction_output(unsigned gpio, int value) -{ - if (!rdc321x_is_gpio(gpio)) - return -EINVAL; - - gpio_set_value(gpio, value); - rdc321x_configure_gpio(gpio); - - return 0; -} -EXPORT_SYMBOL(rdc_gpio_direction_output); Index: linux-2.6-tip/arch/x86/mach-rdc321x/platform.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-rdc321x/platform.c +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Generic RDC321x platform devices - * - * Copyright (C) 2007 Florian Fainelli - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version 2 - * of the License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the - * Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, - * Boston, MA 02110-1301, USA. - * - */ - -#include -#include -#include -#include -#include -#include - -#include - -/* LEDS */ -static struct gpio_led default_leds[] = { - { .name = "rdc:dmz", .gpio = 1, }, -}; - -static struct gpio_led_platform_data rdc321x_led_data = { - .num_leds = ARRAY_SIZE(default_leds), - .leds = default_leds, -}; - -static struct platform_device rdc321x_leds = { - .name = "leds-gpio", - .id = -1, - .dev = { - .platform_data = &rdc321x_led_data, - } -}; - -/* Watchdog */ -static struct platform_device rdc321x_wdt = { - .name = "rdc321x-wdt", - .id = -1, - .num_resources = 0, -}; - -static struct platform_device *rdc321x_devs[] = { - &rdc321x_leds, - &rdc321x_wdt -}; - -static int __init rdc_board_setup(void) -{ - rdc321x_gpio_setup(); - - return platform_add_devices(rdc321x_devs, ARRAY_SIZE(rdc321x_devs)); -} - -arch_initcall(rdc_board_setup); Index: linux-2.6-tip/arch/x86/mach-voyager/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/Makefile +++ /dev/null @@ -1,8 +0,0 @@ -# -# Makefile for the linux kernel. -# - -EXTRA_CFLAGS := -Iarch/x86/kernel -obj-y := setup.o voyager_basic.o voyager_thread.o - -obj-$(CONFIG_SMP) += voyager_smp.o voyager_cat.o Index: linux-2.6-tip/arch/x86/mach-voyager/setup.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/setup.c +++ /dev/null @@ -1,118 +0,0 @@ -/* - * Machine specific setup for generic - */ - -#include -#include -#include -#include -#include -#include -#include - -void __init pre_intr_init_hook(void) -{ - init_ISA_irqs(); -} - -/* - * IRQ2 is cascade interrupt to second interrupt controller - */ -static struct irqaction irq2 = { - .handler = no_action, - .mask = CPU_MASK_NONE, - .name = "cascade", -}; - -void __init intr_init_hook(void) -{ -#ifdef CONFIG_SMP - voyager_smp_intr_init(); -#endif - - setup_irq(2, &irq2); -} - -static void voyager_disable_tsc(void) -{ - /* Voyagers run their CPUs from independent clocks, so disable - * the TSC code because we can't sync them */ - setup_clear_cpu_cap(X86_FEATURE_TSC); -} - -void __init pre_setup_arch_hook(void) -{ - voyager_disable_tsc(); -} - -void __init pre_time_init_hook(void) -{ - voyager_disable_tsc(); -} - -void __init trap_init_hook(void) -{ -} - -static struct irqaction irq0 = { - .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER, - .mask = CPU_MASK_NONE, - .name = "timer" -}; - -void __init time_init_hook(void) -{ - irq0.mask = cpumask_of_cpu(safe_smp_processor_id()); - setup_irq(0, &irq0); -} - -/* Hook for machine specific memory setup. */ - -char *__init machine_specific_memory_setup(void) -{ - char *who; - int new_nr; - - who = "NOT VOYAGER"; - - if (voyager_level == 5) { - __u32 addr, length; - int i; - - who = "Voyager-SUS"; - - e820.nr_map = 0; - for (i = 0; voyager_memory_detect(i, &addr, &length); i++) { - e820_add_region(addr, length, E820_RAM); - } - return who; - } else if (voyager_level == 4) { - __u32 tom; - __u16 catbase = inb(VOYAGER_SSPB_RELOCATION_PORT) << 8; - /* select the DINO config space */ - outb(VOYAGER_DINO, VOYAGER_CAT_CONFIG_PORT); - /* Read DINO top of memory register */ - tom = ((inb(catbase + 0x4) & 0xf0) << 16) - + ((inb(catbase + 0x5) & 0x7f) << 24); - - if (inb(catbase) != VOYAGER_DINO) { - printk(KERN_ERR - "Voyager: Failed to get DINO for L4, setting tom to EXT_MEM_K\n"); - tom = (boot_params.screen_info.ext_mem_k) << 10; - } - who = "Voyager-TOM"; - e820_add_region(0, 0x9f000, E820_RAM); - /* map from 1M to top of memory */ - e820_add_region(1 * 1024 * 1024, tom - 1 * 1024 * 1024, - E820_RAM); - /* FIXME: Should check the ASICs to see if I need to - * take out the 8M window. Just do it at the moment - * */ - e820_add_region(8 * 1024 * 1024, 8 * 1024 * 1024, - E820_RESERVED); - return who; - } - - return default_machine_specific_memory_setup(); -} Index: linux-2.6-tip/arch/x86/mach-voyager/voyager_basic.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/voyager_basic.c +++ /dev/null @@ -1,317 +0,0 @@ -/* Copyright (C) 1999,2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * This file contains all the voyager specific routines for getting - * initialisation of the architecture to function. For additional - * features see: - * - * voyager_cat.c - Voyager CAT bus interface - * voyager_smp.c - Voyager SMP hal (emulates linux smp.c) - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* - * Power off function, if any - */ -void (*pm_power_off) (void); -EXPORT_SYMBOL(pm_power_off); - -int voyager_level = 0; - -struct voyager_SUS *voyager_SUS = NULL; - -#ifdef CONFIG_SMP -static void voyager_dump(int dummy1, struct tty_struct *dummy3) -{ - /* get here via a sysrq */ - voyager_smp_dump(); -} - -static struct sysrq_key_op sysrq_voyager_dump_op = { - .handler = voyager_dump, - .help_msg = "Voyager", - .action_msg = "Dump Voyager Status", -}; -#endif - -void voyager_detect(struct voyager_bios_info *bios) -{ - if (bios->len != 0xff) { - int class = (bios->class_1 << 8) - | (bios->class_2 & 0xff); - - printk("Voyager System detected.\n" - " Class %x, Revision %d.%d\n", - class, bios->major, bios->minor); - if (class == VOYAGER_LEVEL4) - voyager_level = 4; - else if (class < VOYAGER_LEVEL5_AND_ABOVE) - voyager_level = 3; - else - voyager_level = 5; - printk(" Architecture Level %d\n", voyager_level); - if (voyager_level < 4) - printk - ("\n**WARNING**: Voyager HAL only supports Levels 4 and 5 Architectures at the moment\n\n"); - /* install the power off handler */ - pm_power_off = voyager_power_off; -#ifdef CONFIG_SMP - register_sysrq_key('v', &sysrq_voyager_dump_op); -#endif - } else { - printk("\n\n**WARNING**: No Voyager Subsystem Found\n"); - } -} - -void voyager_system_interrupt(int cpl, void *dev_id) -{ - printk("Voyager: detected system interrupt\n"); -} - -/* Routine to read information from the extended CMOS area */ -__u8 voyager_extended_cmos_read(__u16 addr) -{ - outb(addr & 0xff, 0x74); - outb((addr >> 8) & 0xff, 0x75); - return inb(0x76); -} - -/* internal definitions for the SUS Click Map of memory */ - -#define CLICK_ENTRIES 16 -#define CLICK_SIZE 4096 /* click to byte conversion for Length */ - -typedef struct ClickMap { - struct Entry { - __u32 Address; - __u32 Length; - } Entry[CLICK_ENTRIES]; -} ClickMap_t; - -/* This routine is pretty much an awful hack to read the bios clickmap by - * mapping it into page 0. There are usually three regions in the map: - * Base Memory - * Extended Memory - * zero length marker for end of map - * - * Returns are 0 for failure and 1 for success on extracting region. - */ -int __init voyager_memory_detect(int region, __u32 * start, __u32 * length) -{ - int i; - int retval = 0; - __u8 cmos[4]; - ClickMap_t *map; - unsigned long map_addr; - unsigned long old; - - if (region >= CLICK_ENTRIES) { - printk("Voyager: Illegal ClickMap region %d\n", region); - return 0; - } - - for (i = 0; i < sizeof(cmos); i++) - cmos[i] = - voyager_extended_cmos_read(VOYAGER_MEMORY_CLICKMAP + i); - - map_addr = *(unsigned long *)cmos; - - /* steal page 0 for this */ - old = pg0[0]; - pg0[0] = ((map_addr & PAGE_MASK) | _PAGE_RW | _PAGE_PRESENT); - local_flush_tlb(); - /* now clear everything out but page 0 */ - map = (ClickMap_t *) (map_addr & (~PAGE_MASK)); - - /* zero length is the end of the clickmap */ - if (map->Entry[region].Length != 0) { - *length = map->Entry[region].Length * CLICK_SIZE; - *start = map->Entry[region].Address; - retval = 1; - } - - /* replace the mapping */ - pg0[0] = old; - local_flush_tlb(); - return retval; -} - -/* voyager specific handling code for timer interrupts. Used to hand - * off the timer tick to the SMP code, since the VIC doesn't have an - * internal timer (The QIC does, but that's another story). */ -void voyager_timer_interrupt(void) -{ - if ((jiffies & 0x3ff) == 0) { - - /* There seems to be something flaky in either - * hardware or software that is resetting the timer 0 - * count to something much higher than it should be - * This seems to occur in the boot sequence, just - * before root is mounted. Therefore, every 10 - * seconds or so, we sanity check the timer zero count - * and kick it back to where it should be. - * - * FIXME: This is the most awful hack yet seen. I - * should work out exactly what is interfering with - * the timer count settings early in the boot sequence - * and swiftly introduce it to something sharp and - * pointy. */ - __u16 val; - - spin_lock(&i8253_lock); - - outb_p(0x00, 0x43); - val = inb_p(0x40); - val |= inb(0x40) << 8; - spin_unlock(&i8253_lock); - - if (val > LATCH) { - printk - ("\nVOYAGER: countdown timer value too high (%d), resetting\n\n", - val); - spin_lock(&i8253_lock); - outb(0x34, 0x43); - outb_p(LATCH & 0xff, 0x40); /* LSB */ - outb(LATCH >> 8, 0x40); /* MSB */ - spin_unlock(&i8253_lock); - } - } -#ifdef CONFIG_SMP - smp_vic_timer_interrupt(); -#endif -} - -void voyager_power_off(void) -{ - printk("VOYAGER Power Off\n"); - - if (voyager_level == 5) { - voyager_cat_power_off(); - } else if (voyager_level == 4) { - /* This doesn't apparently work on most L4 machines, - * but the specs say to do this to get automatic power - * off. Unfortunately, if it doesn't power off the - * machine, it ends up doing a cold restart, which - * isn't really intended, so comment out the code */ -#if 0 - int port; - - /* enable the voyager Configuration Space */ - outb((inb(VOYAGER_MC_SETUP) & 0xf0) | 0x8, VOYAGER_MC_SETUP); - /* the port for the power off flag is an offset from the - floating base */ - port = (inb(VOYAGER_SSPB_RELOCATION_PORT) << 8) + 0x21; - /* set the power off flag */ - outb(inb(port) | 0x1, port); -#endif - } - /* and wait for it to happen */ - local_irq_disable(); - for (;;) - halt(); -} - -/* copied from process.c */ -static inline void kb_wait(void) -{ - int i; - - for (i = 0; i < 0x10000; i++) - if ((inb_p(0x64) & 0x02) == 0) - break; -} - -void machine_shutdown(void) -{ - /* Architecture specific shutdown needed before a kexec */ -} - -void machine_restart(char *cmd) -{ - printk("Voyager Warm Restart\n"); - kb_wait(); - - if (voyager_level == 5) { - /* write magic values to the RTC to inform system that - * shutdown is beginning */ - outb(0x8f, 0x70); - outb(0x5, 0x71); - - udelay(50); - outb(0xfe, 0x64); /* pull reset low */ - } else if (voyager_level == 4) { - __u16 catbase = inb(VOYAGER_SSPB_RELOCATION_PORT) << 8; - __u8 basebd = inb(VOYAGER_MC_SETUP); - - outb(basebd | 0x08, VOYAGER_MC_SETUP); - outb(0x02, catbase + 0x21); - } - local_irq_disable(); - for (;;) - halt(); -} - -void machine_emergency_restart(void) -{ - /*for now, just hook this to a warm restart */ - machine_restart(NULL); -} - -void mca_nmi_hook(void) -{ - __u8 dumpval __maybe_unused = inb(0xf823); - __u8 swnmi __maybe_unused = inb(0xf813); - - /* FIXME: assume dump switch pressed */ - /* check to see if the dump switch was pressed */ - VDEBUG(("VOYAGER: dumpval = 0x%x, swnmi = 0x%x\n", dumpval, swnmi)); - /* clear swnmi */ - outb(0xff, 0xf813); - /* tell SUS to ignore dump */ - if (voyager_level == 5 && voyager_SUS != NULL) { - if (voyager_SUS->SUS_mbox == VOYAGER_DUMP_BUTTON_NMI) { - voyager_SUS->kernel_mbox = VOYAGER_NO_COMMAND; - voyager_SUS->kernel_flags |= VOYAGER_OS_IN_PROGRESS; - udelay(1000); - voyager_SUS->kernel_mbox = VOYAGER_IGNORE_DUMP; - voyager_SUS->kernel_flags &= ~VOYAGER_OS_IN_PROGRESS; - } - } - printk(KERN_ERR - "VOYAGER: Dump switch pressed, printing CPU%d tracebacks\n", - smp_processor_id()); - show_stack(NULL, NULL); - show_state(); -} - -void machine_halt(void) -{ - /* treat a halt like a power off */ - machine_power_off(); -} - -void machine_power_off(void) -{ - if (pm_power_off) - pm_power_off(); -} Index: linux-2.6-tip/arch/x86/mach-voyager/voyager_cat.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/voyager_cat.c +++ /dev/null @@ -1,1197 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8 -*- */ - -/* Copyright (C) 1999,2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * This file contains all the logic for manipulating the CAT bus - * in a level 5 machine. - * - * The CAT bus is a serial configuration and test bus. Its primary - * uses are to probe the initial configuration of the system and to - * diagnose error conditions when a system interrupt occurs. The low - * level interface is fairly primitive, so most of this file consists - * of bit shift manipulations to send and receive packets on the - * serial bus */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#ifdef VOYAGER_CAT_DEBUG -#define CDEBUG(x) printk x -#else -#define CDEBUG(x) -#endif - -/* the CAT command port */ -#define CAT_CMD (sspb + 0xe) -/* the CAT data port */ -#define CAT_DATA (sspb + 0xd) - -/* the internal cat functions */ -static void cat_pack(__u8 * msg, __u16 start_bit, __u8 * data, __u16 num_bits); -static void cat_unpack(__u8 * msg, __u16 start_bit, __u8 * data, - __u16 num_bits); -static void cat_build_header(__u8 * header, const __u16 len, - const __u16 smallest_reg_bits, - const __u16 longest_reg_bits); -static int cat_sendinst(voyager_module_t * modp, voyager_asic_t * asicp, - __u8 reg, __u8 op); -static int cat_getdata(voyager_module_t * modp, voyager_asic_t * asicp, - __u8 reg, __u8 * value); -static int cat_shiftout(__u8 * data, __u16 data_bytes, __u16 header_bytes, - __u8 pad_bits); -static int cat_write(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, - __u8 value); -static int cat_read(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, - __u8 * value); -static int cat_subread(voyager_module_t * modp, voyager_asic_t * asicp, - __u16 offset, __u16 len, void *buf); -static int cat_senddata(voyager_module_t * modp, voyager_asic_t * asicp, - __u8 reg, __u8 value); -static int cat_disconnect(voyager_module_t * modp, voyager_asic_t * asicp); -static int cat_connect(voyager_module_t * modp, voyager_asic_t * asicp); - -static inline const char *cat_module_name(int module_id) -{ - switch (module_id) { - case 0x10: - return "Processor Slot 0"; - case 0x11: - return "Processor Slot 1"; - case 0x12: - return "Processor Slot 2"; - case 0x13: - return "Processor Slot 4"; - case 0x14: - return "Memory Slot 0"; - case 0x15: - return "Memory Slot 1"; - case 0x18: - return "Primary Microchannel"; - case 0x19: - return "Secondary Microchannel"; - case 0x1a: - return "Power Supply Interface"; - case 0x1c: - return "Processor Slot 5"; - case 0x1d: - return "Processor Slot 6"; - case 0x1e: - return "Processor Slot 7"; - case 0x1f: - return "Processor Slot 8"; - default: - return "Unknown Module"; - } -} - -static int sspb = 0; /* stores the super port location */ -int voyager_8slot = 0; /* set to true if a 51xx monster */ - -voyager_module_t *voyager_cat_list; - -/* the I/O port assignments for the VIC and QIC */ -static struct resource vic_res = { - .name = "Voyager Interrupt Controller", - .start = 0xFC00, - .end = 0xFC6F -}; -static struct resource qic_res = { - .name = "Quad Interrupt Controller", - .start = 0xFC70, - .end = 0xFCFF -}; - -/* This function is used to pack a data bit stream inside a message. - * It writes num_bits of the data buffer in msg starting at start_bit. - * Note: This function assumes that any unused bit in the data stream - * is set to zero so that the ors will work correctly */ -static void -cat_pack(__u8 * msg, const __u16 start_bit, __u8 * data, const __u16 num_bits) -{ - /* compute initial shift needed */ - const __u16 offset = start_bit % BITS_PER_BYTE; - __u16 len = num_bits / BITS_PER_BYTE; - __u16 byte = start_bit / BITS_PER_BYTE; - __u16 residue = (num_bits % BITS_PER_BYTE) + offset; - int i; - - /* adjust if we have more than a byte of residue */ - if (residue >= BITS_PER_BYTE) { - residue -= BITS_PER_BYTE; - len++; - } - - /* clear out the bits. We assume here that if len==0 then - * residue >= offset. This is always true for the catbus - * operations */ - msg[byte] &= 0xff << (BITS_PER_BYTE - offset); - msg[byte++] |= data[0] >> offset; - if (len == 0) - return; - for (i = 1; i < len; i++) - msg[byte++] = (data[i - 1] << (BITS_PER_BYTE - offset)) - | (data[i] >> offset); - if (residue != 0) { - __u8 mask = 0xff >> residue; - __u8 last_byte = data[i - 1] << (BITS_PER_BYTE - offset) - | (data[i] >> offset); - - last_byte &= ~mask; - msg[byte] &= mask; - msg[byte] |= last_byte; - } - return; -} - -/* unpack the data again (same arguments as cat_pack()). data buffer - * must be zero populated. - * - * Function: given a message string move to start_bit and copy num_bits into - * data (starting at bit 0 in data). - */ -static void -cat_unpack(__u8 * msg, const __u16 start_bit, __u8 * data, const __u16 num_bits) -{ - /* compute initial shift needed */ - const __u16 offset = start_bit % BITS_PER_BYTE; - __u16 len = num_bits / BITS_PER_BYTE; - const __u8 last_bits = num_bits % BITS_PER_BYTE; - __u16 byte = start_bit / BITS_PER_BYTE; - int i; - - if (last_bits != 0) - len++; - - /* special case: want < 8 bits from msg and we can get it from - * a single byte of the msg */ - if (len == 0 && BITS_PER_BYTE - offset >= num_bits) { - data[0] = msg[byte] << offset; - data[0] &= 0xff >> (BITS_PER_BYTE - num_bits); - return; - } - for (i = 0; i < len; i++) { - /* this annoying if has to be done just in case a read of - * msg one beyond the array causes a panic */ - if (offset != 0) { - data[i] = msg[byte++] << offset; - data[i] |= msg[byte] >> (BITS_PER_BYTE - offset); - } else { - data[i] = msg[byte++]; - } - } - /* do we need to truncate the final byte */ - if (last_bits != 0) { - data[i - 1] &= 0xff << (BITS_PER_BYTE - last_bits); - } - return; -} - -static void -cat_build_header(__u8 * header, const __u16 len, const __u16 smallest_reg_bits, - const __u16 longest_reg_bits) -{ - int i; - __u16 start_bit = (smallest_reg_bits - 1) % BITS_PER_BYTE; - __u8 *last_byte = &header[len - 1]; - - if (start_bit == 0) - start_bit = 1; /* must have at least one bit in the hdr */ - - for (i = 0; i < len; i++) - header[i] = 0; - - for (i = start_bit; i > 0; i--) - *last_byte = ((*last_byte) << 1) + 1; - -} - -static int -cat_sendinst(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, __u8 op) -{ - __u8 parity, inst, inst_buf[4] = { 0 }; - __u8 iseq[VOYAGER_MAX_SCAN_PATH], hseq[VOYAGER_MAX_REG_SIZE]; - __u16 ibytes, hbytes, padbits; - int i; - - /* - * Parity is the parity of the register number + 1 (READ_REGISTER - * and WRITE_REGISTER always add '1' to the number of bits == 1) - */ - parity = (__u8) (1 + (reg & 0x01) + - ((__u8) (reg & 0x02) >> 1) + - ((__u8) (reg & 0x04) >> 2) + - ((__u8) (reg & 0x08) >> 3)) % 2; - - inst = ((parity << 7) | (reg << 2) | op); - - outb(VOYAGER_CAT_IRCYC, CAT_CMD); - if (!modp->scan_path_connected) { - if (asicp->asic_id != VOYAGER_CAT_ID) { - printk - ("**WARNING***: cat_sendinst has disconnected scan path not to CAT asic\n"); - return 1; - } - outb(VOYAGER_CAT_HEADER, CAT_DATA); - outb(inst, CAT_DATA); - if (inb(CAT_DATA) != VOYAGER_CAT_HEADER) { - CDEBUG(("VOYAGER CAT: cat_sendinst failed to get CAT_HEADER\n")); - return 1; - } - return 0; - } - ibytes = modp->inst_bits / BITS_PER_BYTE; - if ((padbits = modp->inst_bits % BITS_PER_BYTE) != 0) { - padbits = BITS_PER_BYTE - padbits; - ibytes++; - } - hbytes = modp->largest_reg / BITS_PER_BYTE; - if (modp->largest_reg % BITS_PER_BYTE) - hbytes++; - CDEBUG(("cat_sendinst: ibytes=%d, hbytes=%d\n", ibytes, hbytes)); - /* initialise the instruction sequence to 0xff */ - for (i = 0; i < ibytes + hbytes; i++) - iseq[i] = 0xff; - cat_build_header(hseq, hbytes, modp->smallest_reg, modp->largest_reg); - cat_pack(iseq, modp->inst_bits, hseq, hbytes * BITS_PER_BYTE); - inst_buf[0] = inst; - inst_buf[1] = 0xFF >> (modp->largest_reg % BITS_PER_BYTE); - cat_pack(iseq, asicp->bit_location, inst_buf, asicp->ireg_length); -#ifdef VOYAGER_CAT_DEBUG - printk("ins = 0x%x, iseq: ", inst); - for (i = 0; i < ibytes + hbytes; i++) - printk("0x%x ", iseq[i]); - printk("\n"); -#endif - if (cat_shiftout(iseq, ibytes, hbytes, padbits)) { - CDEBUG(("VOYAGER CAT: cat_sendinst: cat_shiftout failed\n")); - return 1; - } - CDEBUG(("CAT SHIFTOUT DONE\n")); - return 0; -} - -static int -cat_getdata(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, - __u8 * value) -{ - if (!modp->scan_path_connected) { - if (asicp->asic_id != VOYAGER_CAT_ID) { - CDEBUG(("VOYAGER CAT: ERROR: cat_getdata to CAT asic with scan path connected\n")); - return 1; - } - if (reg > VOYAGER_SUBADDRHI) - outb(VOYAGER_CAT_RUN, CAT_CMD); - outb(VOYAGER_CAT_DRCYC, CAT_CMD); - outb(VOYAGER_CAT_HEADER, CAT_DATA); - *value = inb(CAT_DATA); - outb(0xAA, CAT_DATA); - if (inb(CAT_DATA) != VOYAGER_CAT_HEADER) { - CDEBUG(("cat_getdata: failed to get VOYAGER_CAT_HEADER\n")); - return 1; - } - return 0; - } else { - __u16 sbits = modp->num_asics - 1 + asicp->ireg_length; - __u16 sbytes = sbits / BITS_PER_BYTE; - __u16 tbytes; - __u8 string[VOYAGER_MAX_SCAN_PATH], - trailer[VOYAGER_MAX_REG_SIZE]; - __u8 padbits; - int i; - - outb(VOYAGER_CAT_DRCYC, CAT_CMD); - - if ((padbits = sbits % BITS_PER_BYTE) != 0) { - padbits = BITS_PER_BYTE - padbits; - sbytes++; - } - tbytes = asicp->ireg_length / BITS_PER_BYTE; - if (asicp->ireg_length % BITS_PER_BYTE) - tbytes++; - CDEBUG(("cat_getdata: tbytes = %d, sbytes = %d, padbits = %d\n", - tbytes, sbytes, padbits)); - cat_build_header(trailer, tbytes, 1, asicp->ireg_length); - - for (i = tbytes - 1; i >= 0; i--) { - outb(trailer[i], CAT_DATA); - string[sbytes + i] = inb(CAT_DATA); - } - - for (i = sbytes - 1; i >= 0; i--) { - outb(0xaa, CAT_DATA); - string[i] = inb(CAT_DATA); - } - *value = 0; - cat_unpack(string, - padbits + (tbytes * BITS_PER_BYTE) + - asicp->asic_location, value, asicp->ireg_length); -#ifdef VOYAGER_CAT_DEBUG - printk("value=0x%x, string: ", *value); - for (i = 0; i < tbytes + sbytes; i++) - printk("0x%x ", string[i]); - printk("\n"); -#endif - - /* sanity check the rest of the return */ - for (i = 0; i < tbytes; i++) { - __u8 input = 0; - - cat_unpack(string, padbits + (i * BITS_PER_BYTE), - &input, BITS_PER_BYTE); - if (trailer[i] != input) { - CDEBUG(("cat_getdata: failed to sanity check rest of ret(%d) 0x%x != 0x%x\n", i, input, trailer[i])); - return 1; - } - } - CDEBUG(("cat_getdata DONE\n")); - return 0; - } -} - -static int -cat_shiftout(__u8 * data, __u16 data_bytes, __u16 header_bytes, __u8 pad_bits) -{ - int i; - - for (i = data_bytes + header_bytes - 1; i >= header_bytes; i--) - outb(data[i], CAT_DATA); - - for (i = header_bytes - 1; i >= 0; i--) { - __u8 header = 0; - __u8 input; - - outb(data[i], CAT_DATA); - input = inb(CAT_DATA); - CDEBUG(("cat_shiftout: returned 0x%x\n", input)); - cat_unpack(data, ((data_bytes + i) * BITS_PER_BYTE) - pad_bits, - &header, BITS_PER_BYTE); - if (input != header) { - CDEBUG(("VOYAGER CAT: cat_shiftout failed to return header 0x%x != 0x%x\n", input, header)); - return 1; - } - } - return 0; -} - -static int -cat_senddata(voyager_module_t * modp, voyager_asic_t * asicp, - __u8 reg, __u8 value) -{ - outb(VOYAGER_CAT_DRCYC, CAT_CMD); - if (!modp->scan_path_connected) { - if (asicp->asic_id != VOYAGER_CAT_ID) { - CDEBUG(("VOYAGER CAT: ERROR: scan path disconnected when asic != CAT\n")); - return 1; - } - outb(VOYAGER_CAT_HEADER, CAT_DATA); - outb(value, CAT_DATA); - if (inb(CAT_DATA) != VOYAGER_CAT_HEADER) { - CDEBUG(("cat_senddata: failed to get correct header response to sent data\n")); - return 1; - } - if (reg > VOYAGER_SUBADDRHI) { - outb(VOYAGER_CAT_RUN, CAT_CMD); - outb(VOYAGER_CAT_END, CAT_CMD); - outb(VOYAGER_CAT_RUN, CAT_CMD); - } - - return 0; - } else { - __u16 hbytes = asicp->ireg_length / BITS_PER_BYTE; - __u16 dbytes = - (modp->num_asics - 1 + asicp->ireg_length) / BITS_PER_BYTE; - __u8 padbits, dseq[VOYAGER_MAX_SCAN_PATH], - hseq[VOYAGER_MAX_REG_SIZE]; - int i; - - if ((padbits = (modp->num_asics - 1 - + asicp->ireg_length) % BITS_PER_BYTE) != 0) { - padbits = BITS_PER_BYTE - padbits; - dbytes++; - } - if (asicp->ireg_length % BITS_PER_BYTE) - hbytes++; - - cat_build_header(hseq, hbytes, 1, asicp->ireg_length); - - for (i = 0; i < dbytes + hbytes; i++) - dseq[i] = 0xff; - CDEBUG(("cat_senddata: dbytes=%d, hbytes=%d, padbits=%d\n", - dbytes, hbytes, padbits)); - cat_pack(dseq, modp->num_asics - 1 + asicp->ireg_length, - hseq, hbytes * BITS_PER_BYTE); - cat_pack(dseq, asicp->asic_location, &value, - asicp->ireg_length); -#ifdef VOYAGER_CAT_DEBUG - printk("dseq "); - for (i = 0; i < hbytes + dbytes; i++) { - printk("0x%x ", dseq[i]); - } - printk("\n"); -#endif - return cat_shiftout(dseq, dbytes, hbytes, padbits); - } -} - -static int -cat_write(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, __u8 value) -{ - if (cat_sendinst(modp, asicp, reg, VOYAGER_WRITE_CONFIG)) - return 1; - return cat_senddata(modp, asicp, reg, value); -} - -static int -cat_read(voyager_module_t * modp, voyager_asic_t * asicp, __u8 reg, - __u8 * value) -{ - if (cat_sendinst(modp, asicp, reg, VOYAGER_READ_CONFIG)) - return 1; - return cat_getdata(modp, asicp, reg, value); -} - -static int -cat_subaddrsetup(voyager_module_t * modp, voyager_asic_t * asicp, __u16 offset, - __u16 len) -{ - __u8 val; - - if (len > 1) { - /* set auto increment */ - __u8 newval; - - if (cat_read(modp, asicp, VOYAGER_AUTO_INC_REG, &val)) { - CDEBUG(("cat_subaddrsetup: read of VOYAGER_AUTO_INC_REG failed\n")); - return 1; - } - CDEBUG(("cat_subaddrsetup: VOYAGER_AUTO_INC_REG = 0x%x\n", - val)); - newval = val | VOYAGER_AUTO_INC; - if (newval != val) { - if (cat_write(modp, asicp, VOYAGER_AUTO_INC_REG, val)) { - CDEBUG(("cat_subaddrsetup: write to VOYAGER_AUTO_INC_REG failed\n")); - return 1; - } - } - } - if (cat_write(modp, asicp, VOYAGER_SUBADDRLO, (__u8) (offset & 0xff))) { - CDEBUG(("cat_subaddrsetup: write to SUBADDRLO failed\n")); - return 1; - } - if (asicp->subaddr > VOYAGER_SUBADDR_LO) { - if (cat_write - (modp, asicp, VOYAGER_SUBADDRHI, (__u8) (offset >> 8))) { - CDEBUG(("cat_subaddrsetup: write to SUBADDRHI failed\n")); - return 1; - } - cat_read(modp, asicp, VOYAGER_SUBADDRHI, &val); - CDEBUG(("cat_subaddrsetup: offset = %d, hi = %d\n", offset, - val)); - } - cat_read(modp, asicp, VOYAGER_SUBADDRLO, &val); - CDEBUG(("cat_subaddrsetup: offset = %d, lo = %d\n", offset, val)); - return 0; -} - -static int -cat_subwrite(voyager_module_t * modp, voyager_asic_t * asicp, __u16 offset, - __u16 len, void *buf) -{ - int i, retval; - - /* FIXME: need special actions for VOYAGER_CAT_ID here */ - if (asicp->asic_id == VOYAGER_CAT_ID) { - CDEBUG(("cat_subwrite: ATTEMPT TO WRITE TO CAT ASIC\n")); - /* FIXME -- This is supposed to be handled better - * There is a problem writing to the cat asic in the - * PSI. The 30us delay seems to work, though */ - udelay(30); - } - - if ((retval = cat_subaddrsetup(modp, asicp, offset, len)) != 0) { - printk("cat_subwrite: cat_subaddrsetup FAILED\n"); - return retval; - } - - if (cat_sendinst - (modp, asicp, VOYAGER_SUBADDRDATA, VOYAGER_WRITE_CONFIG)) { - printk("cat_subwrite: cat_sendinst FAILED\n"); - return 1; - } - for (i = 0; i < len; i++) { - if (cat_senddata(modp, asicp, 0xFF, ((__u8 *) buf)[i])) { - printk - ("cat_subwrite: cat_sendata element at %d FAILED\n", - i); - return 1; - } - } - return 0; -} -static int -cat_subread(voyager_module_t * modp, voyager_asic_t * asicp, __u16 offset, - __u16 len, void *buf) -{ - int i, retval; - - if ((retval = cat_subaddrsetup(modp, asicp, offset, len)) != 0) { - CDEBUG(("cat_subread: cat_subaddrsetup FAILED\n")); - return retval; - } - - if (cat_sendinst(modp, asicp, VOYAGER_SUBADDRDATA, VOYAGER_READ_CONFIG)) { - CDEBUG(("cat_subread: cat_sendinst failed\n")); - return 1; - } - for (i = 0; i < len; i++) { - if (cat_getdata(modp, asicp, 0xFF, &((__u8 *) buf)[i])) { - CDEBUG(("cat_subread: cat_getdata element %d failed\n", - i)); - return 1; - } - } - return 0; -} - -/* buffer for storing EPROM data read in during initialisation */ -static __initdata __u8 eprom_buf[0xFFFF]; -static voyager_module_t *voyager_initial_module; - -/* Initialise the cat bus components. We assume this is called by the - * boot cpu *after* all memory initialisation has been done (so we can - * use kmalloc) but before smp initialisation, so we can probe the SMP - * configuration and pick up necessary information. */ -void __init voyager_cat_init(void) -{ - voyager_module_t **modpp = &voyager_initial_module; - voyager_asic_t **asicpp; - voyager_asic_t *qabc_asic = NULL; - int i, j; - unsigned long qic_addr = 0; - __u8 qabc_data[0x20]; - __u8 num_submodules, val; - voyager_eprom_hdr_t *eprom_hdr = (voyager_eprom_hdr_t *) & eprom_buf[0]; - - __u8 cmos[4]; - unsigned long addr; - - /* initiallise the SUS mailbox */ - for (i = 0; i < sizeof(cmos); i++) - cmos[i] = voyager_extended_cmos_read(VOYAGER_DUMP_LOCATION + i); - addr = *(unsigned long *)cmos; - if ((addr & 0xff000000) != 0xff000000) { - printk(KERN_ERR - "Voyager failed to get SUS mailbox (addr = 0x%lx\n", - addr); - } else { - static struct resource res; - - res.name = "voyager SUS"; - res.start = addr; - res.end = addr + 0x3ff; - - request_resource(&iomem_resource, &res); - voyager_SUS = (struct voyager_SUS *) - ioremap(addr, 0x400); - printk(KERN_NOTICE "Voyager SUS mailbox version 0x%x\n", - voyager_SUS->SUS_version); - voyager_SUS->kernel_version = VOYAGER_MAILBOX_VERSION; - voyager_SUS->kernel_flags = VOYAGER_OS_HAS_SYSINT; - } - - /* clear the processor counts */ - voyager_extended_vic_processors = 0; - voyager_quad_processors = 0; - - printk("VOYAGER: beginning CAT bus probe\n"); - /* set up the SuperSet Port Block which tells us where the - * CAT communication port is */ - sspb = inb(VOYAGER_SSPB_RELOCATION_PORT) * 0x100; - VDEBUG(("VOYAGER DEBUG: sspb = 0x%x\n", sspb)); - - /* now find out if were 8 slot or normal */ - if ((inb(VIC_PROC_WHO_AM_I) & EIGHT_SLOT_IDENTIFIER) - == EIGHT_SLOT_IDENTIFIER) { - voyager_8slot = 1; - printk(KERN_NOTICE - "Voyager: Eight slot 51xx configuration detected\n"); - } - - for (i = VOYAGER_MIN_MODULE; i <= VOYAGER_MAX_MODULE; i++) { - __u8 input; - int asic; - __u16 eprom_size; - __u16 sp_offset; - - outb(VOYAGER_CAT_DESELECT, VOYAGER_CAT_CONFIG_PORT); - outb(i, VOYAGER_CAT_CONFIG_PORT); - - /* check the presence of the module */ - outb(VOYAGER_CAT_RUN, CAT_CMD); - outb(VOYAGER_CAT_IRCYC, CAT_CMD); - outb(VOYAGER_CAT_HEADER, CAT_DATA); - /* stream series of alternating 1's and 0's to stimulate - * response */ - outb(0xAA, CAT_DATA); - input = inb(CAT_DATA); - outb(VOYAGER_CAT_END, CAT_CMD); - if (input != VOYAGER_CAT_HEADER) { - continue; - } - CDEBUG(("VOYAGER DEBUG: found module id 0x%x, %s\n", i, - cat_module_name(i))); - *modpp = kmalloc(sizeof(voyager_module_t), GFP_KERNEL); /*&voyager_module_storage[cat_count++]; */ - if (*modpp == NULL) { - printk("**WARNING** kmalloc failure in cat_init\n"); - continue; - } - memset(*modpp, 0, sizeof(voyager_module_t)); - /* need temporary asic for cat_subread. It will be - * filled in correctly later */ - (*modpp)->asic = kmalloc(sizeof(voyager_asic_t), GFP_KERNEL); /*&voyager_asic_storage[asic_count]; */ - if ((*modpp)->asic == NULL) { - printk("**WARNING** kmalloc failure in cat_init\n"); - continue; - } - memset((*modpp)->asic, 0, sizeof(voyager_asic_t)); - (*modpp)->asic->asic_id = VOYAGER_CAT_ID; - (*modpp)->asic->subaddr = VOYAGER_SUBADDR_HI; - (*modpp)->module_addr = i; - (*modpp)->scan_path_connected = 0; - if (i == VOYAGER_PSI) { - /* Exception leg for modules with no EEPROM */ - printk("Module \"%s\"\n", cat_module_name(i)); - continue; - } - - CDEBUG(("cat_init: Reading eeprom for module 0x%x at offset %d\n", i, VOYAGER_XSUM_END_OFFSET)); - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_disconnect(*modpp, (*modpp)->asic); - if (cat_subread(*modpp, (*modpp)->asic, - VOYAGER_XSUM_END_OFFSET, sizeof(eprom_size), - &eprom_size)) { - printk - ("**WARNING**: Voyager couldn't read EPROM size for module 0x%x\n", - i); - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - if (eprom_size > sizeof(eprom_buf)) { - printk - ("**WARNING**: Voyager insufficient size to read EPROM data, module 0x%x. Need %d\n", - i, eprom_size); - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - outb(VOYAGER_CAT_END, CAT_CMD); - outb(VOYAGER_CAT_RUN, CAT_CMD); - CDEBUG(("cat_init: module 0x%x, eeprom_size %d\n", i, - eprom_size)); - if (cat_subread - (*modpp, (*modpp)->asic, 0, eprom_size, eprom_buf)) { - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - outb(VOYAGER_CAT_END, CAT_CMD); - printk("Module \"%s\", version 0x%x, tracer 0x%x, asics %d\n", - cat_module_name(i), eprom_hdr->version_id, - *((__u32 *) eprom_hdr->tracer), eprom_hdr->num_asics); - (*modpp)->ee_size = eprom_hdr->ee_size; - (*modpp)->num_asics = eprom_hdr->num_asics; - asicpp = &((*modpp)->asic); - sp_offset = eprom_hdr->scan_path_offset; - /* All we really care about are the Quad cards. We - * identify them because they are in a processor slot - * and have only four asics */ - if ((i < 0x10 || (i >= 0x14 && i < 0x1c) || i > 0x1f)) { - modpp = &((*modpp)->next); - continue; - } - /* Now we know it's in a processor slot, does it have - * a quad baseboard submodule */ - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_read(*modpp, (*modpp)->asic, VOYAGER_SUBMODPRESENT, - &num_submodules); - /* lowest two bits, active low */ - num_submodules = ~(0xfc | num_submodules); - CDEBUG(("VOYAGER CAT: %d submodules present\n", - num_submodules)); - if (num_submodules == 0) { - /* fill in the dyadic extended processors */ - __u8 cpu = i & 0x07; - - printk("Module \"%s\": Dyadic Processor Card\n", - cat_module_name(i)); - voyager_extended_vic_processors |= (1 << cpu); - cpu += 4; - voyager_extended_vic_processors |= (1 << cpu); - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - - /* now we want to read the asics on the first submodule, - * which should be the quad base board */ - - cat_read(*modpp, (*modpp)->asic, VOYAGER_SUBMODSELECT, &val); - CDEBUG(("cat_init: SUBMODSELECT value = 0x%x\n", val)); - val = (val & 0x7c) | VOYAGER_QUAD_BASEBOARD; - cat_write(*modpp, (*modpp)->asic, VOYAGER_SUBMODSELECT, val); - - outb(VOYAGER_CAT_END, CAT_CMD); - - CDEBUG(("cat_init: Reading eeprom for module 0x%x at offset %d\n", i, VOYAGER_XSUM_END_OFFSET)); - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_disconnect(*modpp, (*modpp)->asic); - if (cat_subread(*modpp, (*modpp)->asic, - VOYAGER_XSUM_END_OFFSET, sizeof(eprom_size), - &eprom_size)) { - printk - ("**WARNING**: Voyager couldn't read EPROM size for module 0x%x\n", - i); - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - if (eprom_size > sizeof(eprom_buf)) { - printk - ("**WARNING**: Voyager insufficient size to read EPROM data, module 0x%x. Need %d\n", - i, eprom_size); - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - outb(VOYAGER_CAT_END, CAT_CMD); - outb(VOYAGER_CAT_RUN, CAT_CMD); - CDEBUG(("cat_init: module 0x%x, eeprom_size %d\n", i, - eprom_size)); - if (cat_subread - (*modpp, (*modpp)->asic, 0, eprom_size, eprom_buf)) { - outb(VOYAGER_CAT_END, CAT_CMD); - continue; - } - outb(VOYAGER_CAT_END, CAT_CMD); - /* Now do everything for the QBB submodule 1 */ - (*modpp)->ee_size = eprom_hdr->ee_size; - (*modpp)->num_asics = eprom_hdr->num_asics; - asicpp = &((*modpp)->asic); - sp_offset = eprom_hdr->scan_path_offset; - /* get rid of the dummy CAT asic and read the real one */ - kfree((*modpp)->asic); - for (asic = 0; asic < (*modpp)->num_asics; asic++) { - int j; - voyager_asic_t *asicp = *asicpp = kzalloc(sizeof(voyager_asic_t), GFP_KERNEL); /*&voyager_asic_storage[asic_count++]; */ - voyager_sp_table_t *sp_table; - voyager_at_t *asic_table; - voyager_jtt_t *jtag_table; - - if (asicp == NULL) { - printk - ("**WARNING** kmalloc failure in cat_init\n"); - continue; - } - asicpp = &(asicp->next); - asicp->asic_location = asic; - sp_table = - (voyager_sp_table_t *) (eprom_buf + sp_offset); - asicp->asic_id = sp_table->asic_id; - asic_table = - (voyager_at_t *) (eprom_buf + - sp_table->asic_data_offset); - for (j = 0; j < 4; j++) - asicp->jtag_id[j] = asic_table->jtag_id[j]; - jtag_table = - (voyager_jtt_t *) (eprom_buf + - asic_table->jtag_offset); - asicp->ireg_length = jtag_table->ireg_len; - asicp->bit_location = (*modpp)->inst_bits; - (*modpp)->inst_bits += asicp->ireg_length; - if (asicp->ireg_length > (*modpp)->largest_reg) - (*modpp)->largest_reg = asicp->ireg_length; - if (asicp->ireg_length < (*modpp)->smallest_reg || - (*modpp)->smallest_reg == 0) - (*modpp)->smallest_reg = asicp->ireg_length; - CDEBUG(("asic 0x%x, ireg_length=%d, bit_location=%d\n", - asicp->asic_id, asicp->ireg_length, - asicp->bit_location)); - if (asicp->asic_id == VOYAGER_QUAD_QABC) { - CDEBUG(("VOYAGER CAT: QABC ASIC found\n")); - qabc_asic = asicp; - } - sp_offset += sizeof(voyager_sp_table_t); - } - CDEBUG(("Module inst_bits = %d, largest_reg = %d, smallest_reg=%d\n", (*modpp)->inst_bits, (*modpp)->largest_reg, (*modpp)->smallest_reg)); - /* OK, now we have the QUAD ASICs set up, use them. - * we need to: - * - * 1. Find the Memory area for the Quad CPIs. - * 2. Find the Extended VIC processor - * 3. Configure a second extended VIC processor (This - * cannot be done for the 51xx. - * */ - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_connect(*modpp, (*modpp)->asic); - CDEBUG(("CAT CONNECTED!!\n")); - cat_subread(*modpp, qabc_asic, 0, sizeof(qabc_data), qabc_data); - qic_addr = qabc_data[5] << 8; - qic_addr = (qic_addr | qabc_data[6]) << 8; - qic_addr = (qic_addr | qabc_data[7]) << 8; - printk - ("Module \"%s\": Quad Processor Card; CPI 0x%lx, SET=0x%x\n", - cat_module_name(i), qic_addr, qabc_data[8]); -#if 0 /* plumbing fails---FIXME */ - if ((qabc_data[8] & 0xf0) == 0) { - /* FIXME: 32 way 8 CPU slot monster cannot be - * plumbed this way---need to check for it */ - - printk("Plumbing second Extended Quad Processor\n"); - /* second VIC line hardwired to Quad CPU 1 */ - qabc_data[8] |= 0x20; - cat_subwrite(*modpp, qabc_asic, 8, 1, &qabc_data[8]); -#ifdef VOYAGER_CAT_DEBUG - /* verify plumbing */ - cat_subread(*modpp, qabc_asic, 8, 1, &qabc_data[8]); - if ((qabc_data[8] & 0xf0) == 0) { - CDEBUG(("PLUMBING FAILED: 0x%x\n", - qabc_data[8])); - } -#endif - } -#endif - - { - struct resource *res = - kzalloc(sizeof(struct resource), GFP_KERNEL); - res->name = kmalloc(128, GFP_KERNEL); - sprintf((char *)res->name, "Voyager %s Quad CPI", - cat_module_name(i)); - res->start = qic_addr; - res->end = qic_addr + 0x3ff; - request_resource(&iomem_resource, res); - } - - qic_addr = (unsigned long)ioremap_cache(qic_addr, 0x400); - - for (j = 0; j < 4; j++) { - __u8 cpu; - - if (voyager_8slot) { - /* 8 slot has a different mapping, - * each slot has only one vic line, so - * 1 cpu in each slot must be < 8 */ - cpu = (i & 0x07) + j * 8; - } else { - cpu = (i & 0x03) + j * 4; - } - if ((qabc_data[8] & (1 << j))) { - voyager_extended_vic_processors |= (1 << cpu); - } - if (qabc_data[8] & (1 << (j + 4))) { - /* Second SET register plumbed: Quad - * card has two VIC connected CPUs. - * Secondary cannot be booted as a VIC - * CPU */ - voyager_extended_vic_processors |= (1 << cpu); - voyager_allowed_boot_processors &= - (~(1 << cpu)); - } - - voyager_quad_processors |= (1 << cpu); - voyager_quad_cpi_addr[cpu] = (struct voyager_qic_cpi *) - (qic_addr + (j << 8)); - CDEBUG(("CPU%d: CPI address 0x%lx\n", cpu, - (unsigned long)voyager_quad_cpi_addr[cpu])); - } - outb(VOYAGER_CAT_END, CAT_CMD); - - *asicpp = NULL; - modpp = &((*modpp)->next); - } - *modpp = NULL; - printk - ("CAT Bus Initialisation finished: extended procs 0x%x, quad procs 0x%x, allowed vic boot = 0x%x\n", - voyager_extended_vic_processors, voyager_quad_processors, - voyager_allowed_boot_processors); - request_resource(&ioport_resource, &vic_res); - if (voyager_quad_processors) - request_resource(&ioport_resource, &qic_res); - /* set up the front power switch */ -} - -int voyager_cat_readb(__u8 module, __u8 asic, int reg) -{ - return 0; -} - -static int cat_disconnect(voyager_module_t * modp, voyager_asic_t * asicp) -{ - __u8 val; - int err = 0; - - if (!modp->scan_path_connected) - return 0; - if (asicp->asic_id != VOYAGER_CAT_ID) { - CDEBUG(("cat_disconnect: ASIC is not CAT\n")); - return 1; - } - err = cat_read(modp, asicp, VOYAGER_SCANPATH, &val); - if (err) { - CDEBUG(("cat_disconnect: failed to read SCANPATH\n")); - return err; - } - val &= VOYAGER_DISCONNECT_ASIC; - err = cat_write(modp, asicp, VOYAGER_SCANPATH, val); - if (err) { - CDEBUG(("cat_disconnect: failed to write SCANPATH\n")); - return err; - } - outb(VOYAGER_CAT_END, CAT_CMD); - outb(VOYAGER_CAT_RUN, CAT_CMD); - modp->scan_path_connected = 0; - - return 0; -} - -static int cat_connect(voyager_module_t * modp, voyager_asic_t * asicp) -{ - __u8 val; - int err = 0; - - if (modp->scan_path_connected) - return 0; - if (asicp->asic_id != VOYAGER_CAT_ID) { - CDEBUG(("cat_connect: ASIC is not CAT\n")); - return 1; - } - - err = cat_read(modp, asicp, VOYAGER_SCANPATH, &val); - if (err) { - CDEBUG(("cat_connect: failed to read SCANPATH\n")); - return err; - } - val |= VOYAGER_CONNECT_ASIC; - err = cat_write(modp, asicp, VOYAGER_SCANPATH, val); - if (err) { - CDEBUG(("cat_connect: failed to write SCANPATH\n")); - return err; - } - outb(VOYAGER_CAT_END, CAT_CMD); - outb(VOYAGER_CAT_RUN, CAT_CMD); - modp->scan_path_connected = 1; - - return 0; -} - -void voyager_cat_power_off(void) -{ - /* Power the machine off by writing to the PSI over the CAT - * bus */ - __u8 data; - voyager_module_t psi = { 0 }; - voyager_asic_t psi_asic = { 0 }; - - psi.asic = &psi_asic; - psi.asic->asic_id = VOYAGER_CAT_ID; - psi.asic->subaddr = VOYAGER_SUBADDR_HI; - psi.module_addr = VOYAGER_PSI; - psi.scan_path_connected = 0; - - outb(VOYAGER_CAT_END, CAT_CMD); - /* Connect the PSI to the CAT Bus */ - outb(VOYAGER_CAT_DESELECT, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_PSI, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_disconnect(&psi, &psi_asic); - /* Read the status */ - cat_subread(&psi, &psi_asic, VOYAGER_PSI_GENERAL_REG, 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); - CDEBUG(("PSI STATUS 0x%x\n", data)); - /* These two writes are power off prep and perform */ - data = PSI_CLEAR; - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_subwrite(&psi, &psi_asic, VOYAGER_PSI_GENERAL_REG, 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); - data = PSI_POWER_DOWN; - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_subwrite(&psi, &psi_asic, VOYAGER_PSI_GENERAL_REG, 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); -} - -struct voyager_status voyager_status = { 0 }; - -void voyager_cat_psi(__u8 cmd, __u16 reg, __u8 * data) -{ - voyager_module_t psi = { 0 }; - voyager_asic_t psi_asic = { 0 }; - - psi.asic = &psi_asic; - psi.asic->asic_id = VOYAGER_CAT_ID; - psi.asic->subaddr = VOYAGER_SUBADDR_HI; - psi.module_addr = VOYAGER_PSI; - psi.scan_path_connected = 0; - - outb(VOYAGER_CAT_END, CAT_CMD); - /* Connect the PSI to the CAT Bus */ - outb(VOYAGER_CAT_DESELECT, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_PSI, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_disconnect(&psi, &psi_asic); - switch (cmd) { - case VOYAGER_PSI_READ: - cat_read(&psi, &psi_asic, reg, data); - break; - case VOYAGER_PSI_WRITE: - cat_write(&psi, &psi_asic, reg, *data); - break; - case VOYAGER_PSI_SUBREAD: - cat_subread(&psi, &psi_asic, reg, 1, data); - break; - case VOYAGER_PSI_SUBWRITE: - cat_subwrite(&psi, &psi_asic, reg, 1, data); - break; - default: - printk(KERN_ERR "Voyager PSI, unrecognised command %d\n", cmd); - break; - } - outb(VOYAGER_CAT_END, CAT_CMD); -} - -void voyager_cat_do_common_interrupt(void) -{ - /* This is caused either by a memory parity error or something - * in the PSI */ - __u8 data; - voyager_module_t psi = { 0 }; - voyager_asic_t psi_asic = { 0 }; - struct voyager_psi psi_reg; - int i; - re_read: - psi.asic = &psi_asic; - psi.asic->asic_id = VOYAGER_CAT_ID; - psi.asic->subaddr = VOYAGER_SUBADDR_HI; - psi.module_addr = VOYAGER_PSI; - psi.scan_path_connected = 0; - - outb(VOYAGER_CAT_END, CAT_CMD); - /* Connect the PSI to the CAT Bus */ - outb(VOYAGER_CAT_DESELECT, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_PSI, VOYAGER_CAT_CONFIG_PORT); - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_disconnect(&psi, &psi_asic); - /* Read the status. NOTE: Need to read *all* the PSI regs here - * otherwise the cmn int will be reasserted */ - for (i = 0; i < sizeof(psi_reg.regs); i++) { - cat_read(&psi, &psi_asic, i, &((__u8 *) & psi_reg.regs)[i]); - } - outb(VOYAGER_CAT_END, CAT_CMD); - if ((psi_reg.regs.checkbit & 0x02) == 0) { - psi_reg.regs.checkbit |= 0x02; - cat_write(&psi, &psi_asic, 5, psi_reg.regs.checkbit); - printk("VOYAGER RE-READ PSI\n"); - goto re_read; - } - outb(VOYAGER_CAT_RUN, CAT_CMD); - for (i = 0; i < sizeof(psi_reg.subregs); i++) { - /* This looks strange, but the PSI doesn't do auto increment - * correctly */ - cat_subread(&psi, &psi_asic, VOYAGER_PSI_SUPPLY_REG + i, - 1, &((__u8 *) & psi_reg.subregs)[i]); - } - outb(VOYAGER_CAT_END, CAT_CMD); -#ifdef VOYAGER_CAT_DEBUG - printk("VOYAGER PSI: "); - for (i = 0; i < sizeof(psi_reg.regs); i++) - printk("%02x ", ((__u8 *) & psi_reg.regs)[i]); - printk("\n "); - for (i = 0; i < sizeof(psi_reg.subregs); i++) - printk("%02x ", ((__u8 *) & psi_reg.subregs)[i]); - printk("\n"); -#endif - if (psi_reg.regs.intstatus & PSI_MON) { - /* switch off or power fail */ - - if (psi_reg.subregs.supply & PSI_SWITCH_OFF) { - if (voyager_status.switch_off) { - printk(KERN_ERR - "Voyager front panel switch turned off again---Immediate power off!\n"); - voyager_cat_power_off(); - /* not reached */ - } else { - printk(KERN_ERR - "Voyager front panel switch turned off\n"); - voyager_status.switch_off = 1; - voyager_status.request_from_kernel = 1; - wake_up_process(voyager_thread); - } - /* Tell the hardware we're taking care of the - * shutdown, otherwise it will power the box off - * within 3 seconds of the switch being pressed and, - * which is much more important to us, continue to - * assert the common interrupt */ - data = PSI_CLR_SWITCH_OFF; - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_subwrite(&psi, &psi_asic, VOYAGER_PSI_SUPPLY_REG, - 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); - } else { - - VDEBUG(("Voyager ac fail reg 0x%x\n", - psi_reg.subregs.ACfail)); - if ((psi_reg.subregs.ACfail & AC_FAIL_STAT_CHANGE) == 0) { - /* No further update */ - return; - } -#if 0 - /* Don't bother trying to find out who failed. - * FIXME: This probably makes the code incorrect on - * anything other than a 345x */ - for (i = 0; i < 5; i++) { - if (psi_reg.subregs.ACfail & (1 << i)) { - break; - } - } - printk(KERN_NOTICE "AC FAIL IN SUPPLY %d\n", i); -#endif - /* DON'T do this: it shuts down the AC PSI - outb(VOYAGER_CAT_RUN, CAT_CMD); - data = PSI_MASK_MASK | i; - cat_subwrite(&psi, &psi_asic, VOYAGER_PSI_MASK, - 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); - */ - printk(KERN_ERR "Voyager AC power failure\n"); - outb(VOYAGER_CAT_RUN, CAT_CMD); - data = PSI_COLD_START; - cat_subwrite(&psi, &psi_asic, VOYAGER_PSI_GENERAL_REG, - 1, &data); - outb(VOYAGER_CAT_END, CAT_CMD); - voyager_status.power_fail = 1; - voyager_status.request_from_kernel = 1; - wake_up_process(voyager_thread); - } - - } else if (psi_reg.regs.intstatus & PSI_FAULT) { - /* Major fault! */ - printk(KERN_ERR - "Voyager PSI Detected major fault, immediate power off!\n"); - voyager_cat_power_off(); - /* not reached */ - } else if (psi_reg.regs.intstatus & (PSI_DC_FAIL | PSI_ALARM - | PSI_CURRENT | PSI_DVM - | PSI_PSCFAULT | PSI_STAT_CHG)) { - /* other psi fault */ - - printk(KERN_WARNING "Voyager PSI status 0x%x\n", data); - /* clear the PSI fault */ - outb(VOYAGER_CAT_RUN, CAT_CMD); - cat_write(&psi, &psi_asic, VOYAGER_PSI_STATUS_REG, 0); - outb(VOYAGER_CAT_END, CAT_CMD); - } -} Index: linux-2.6-tip/arch/x86/mach-voyager/voyager_smp.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/voyager_smp.c +++ /dev/null @@ -1,1807 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8 -*- */ - -/* Copyright (C) 1999,2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * This file provides all the same external entries as smp.c but uses - * the voyager hal to provide the functionality - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* TLB state -- visible externally, indexed physically */ -DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { &init_mm, 0 }; - -/* CPU IRQ affinity -- set to all ones initially */ -static unsigned long cpu_irq_affinity[NR_CPUS] __cacheline_aligned = - {[0 ... NR_CPUS-1] = ~0UL }; - -/* per CPU data structure (for /proc/cpuinfo et al), visible externally - * indexed physically */ -DEFINE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info); -EXPORT_PER_CPU_SYMBOL(cpu_info); - -/* physical ID of the CPU used to boot the system */ -unsigned char boot_cpu_id; - -/* The memory line addresses for the Quad CPIs */ -struct voyager_qic_cpi *voyager_quad_cpi_addr[NR_CPUS] __cacheline_aligned; - -/* The masks for the Extended VIC processors, filled in by cat_init */ -__u32 voyager_extended_vic_processors = 0; - -/* Masks for the extended Quad processors which cannot be VIC booted */ -__u32 voyager_allowed_boot_processors = 0; - -/* The mask for the Quad Processors (both extended and non-extended) */ -__u32 voyager_quad_processors = 0; - -/* Total count of live CPUs, used in process.c to display - * the CPU information and in irq.c for the per CPU irq - * activity count. Finally exported by i386_ksyms.c */ -static int voyager_extended_cpus = 1; - -/* Used for the invalidate map that's also checked in the spinlock */ -static volatile unsigned long smp_invalidate_needed; - -/* Bitmask of CPUs present in the system - exported by i386_syms.c, used - * by scheduler but indexed physically */ -cpumask_t phys_cpu_present_map = CPU_MASK_NONE; - -/* The internal functions */ -static void send_CPI(__u32 cpuset, __u8 cpi); -static void ack_CPI(__u8 cpi); -static int ack_QIC_CPI(__u8 cpi); -static void ack_special_QIC_CPI(__u8 cpi); -static void ack_VIC_CPI(__u8 cpi); -static void send_CPI_allbutself(__u8 cpi); -static void mask_vic_irq(unsigned int irq); -static void unmask_vic_irq(unsigned int irq); -static unsigned int startup_vic_irq(unsigned int irq); -static void enable_local_vic_irq(unsigned int irq); -static void disable_local_vic_irq(unsigned int irq); -static void before_handle_vic_irq(unsigned int irq); -static void after_handle_vic_irq(unsigned int irq); -static void set_vic_irq_affinity(unsigned int irq, const struct cpumask *mask); -static void ack_vic_irq(unsigned int irq); -static void vic_enable_cpi(void); -static void do_boot_cpu(__u8 cpuid); -static void do_quad_bootstrap(void); -static void initialize_secondary(void); - -int hard_smp_processor_id(void); -int safe_smp_processor_id(void); - -/* Inline functions */ -static inline void send_one_QIC_CPI(__u8 cpu, __u8 cpi) -{ - voyager_quad_cpi_addr[cpu]->qic_cpi[cpi].cpi = - (smp_processor_id() << 16) + cpi; -} - -static inline void send_QIC_CPI(__u32 cpuset, __u8 cpi) -{ - int cpu; - - for_each_online_cpu(cpu) { - if (cpuset & (1 << cpu)) { -#ifdef VOYAGER_DEBUG - if (!cpu_online(cpu)) - VDEBUG(("CPU%d sending cpi %d to CPU%d not in " - "cpu_online_map\n", - hard_smp_processor_id(), cpi, cpu)); -#endif - send_one_QIC_CPI(cpu, cpi - QIC_CPI_OFFSET); - } - } -} - -static inline void wrapper_smp_local_timer_interrupt(void) -{ - irq_enter(); - smp_local_timer_interrupt(); - irq_exit(); -} - -static inline void send_one_CPI(__u8 cpu, __u8 cpi) -{ - if (voyager_quad_processors & (1 << cpu)) - send_one_QIC_CPI(cpu, cpi - QIC_CPI_OFFSET); - else - send_CPI(1 << cpu, cpi); -} - -static inline void send_CPI_allbutself(__u8 cpi) -{ - __u8 cpu = smp_processor_id(); - __u32 mask = cpus_addr(cpu_online_map)[0] & ~(1 << cpu); - send_CPI(mask, cpi); -} - -static inline int is_cpu_quad(void) -{ - __u8 cpumask = inb(VIC_PROC_WHO_AM_I); - return ((cpumask & QUAD_IDENTIFIER) == QUAD_IDENTIFIER); -} - -static inline int is_cpu_extended(void) -{ - __u8 cpu = hard_smp_processor_id(); - - return (voyager_extended_vic_processors & (1 << cpu)); -} - -static inline int is_cpu_vic_boot(void) -{ - __u8 cpu = hard_smp_processor_id(); - - return (voyager_extended_vic_processors - & voyager_allowed_boot_processors & (1 << cpu)); -} - -static inline void ack_CPI(__u8 cpi) -{ - switch (cpi) { - case VIC_CPU_BOOT_CPI: - if (is_cpu_quad() && !is_cpu_vic_boot()) - ack_QIC_CPI(cpi); - else - ack_VIC_CPI(cpi); - break; - case VIC_SYS_INT: - case VIC_CMN_INT: - /* These are slightly strange. Even on the Quad card, - * They are vectored as VIC CPIs */ - if (is_cpu_quad()) - ack_special_QIC_CPI(cpi); - else - ack_VIC_CPI(cpi); - break; - default: - printk("VOYAGER ERROR: CPI%d is in common CPI code\n", cpi); - break; - } -} - -/* local variables */ - -/* The VIC IRQ descriptors -- these look almost identical to the - * 8259 IRQs except that masks and things must be kept per processor - */ -static struct irq_chip vic_chip = { - .name = "VIC", - .startup = startup_vic_irq, - .mask = mask_vic_irq, - .unmask = unmask_vic_irq, - .set_affinity = set_vic_irq_affinity, -}; - -/* used to count up as CPUs are brought on line (starts at 0) */ -static int cpucount = 0; - -/* The per cpu profile stuff - used in smp_local_timer_interrupt */ -static DEFINE_PER_CPU(int, prof_multiplier) = 1; -static DEFINE_PER_CPU(int, prof_old_multiplier) = 1; -static DEFINE_PER_CPU(int, prof_counter) = 1; - -/* the map used to check if a CPU has booted */ -static __u32 cpu_booted_map; - -/* the synchronize flag used to hold all secondary CPUs spinning in - * a tight loop until the boot sequence is ready for them */ -static cpumask_t smp_commenced_mask = CPU_MASK_NONE; - -/* This is for the new dynamic CPU boot code */ - -/* The per processor IRQ masks (these are usually kept in sync) */ -static __u16 vic_irq_mask[NR_CPUS] __cacheline_aligned; - -/* the list of IRQs to be enabled by the VIC_ENABLE_IRQ_CPI */ -static __u16 vic_irq_enable_mask[NR_CPUS] __cacheline_aligned = { 0 }; - -/* Lock for enable/disable of VIC interrupts */ -static __cacheline_aligned DEFINE_SPINLOCK(vic_irq_lock); - -/* The boot processor is correctly set up in PC mode when it - * comes up, but the secondaries need their master/slave 8259 - * pairs initializing correctly */ - -/* Interrupt counters (per cpu) and total - used to try to - * even up the interrupt handling routines */ -static long vic_intr_total = 0; -static long vic_intr_count[NR_CPUS] __cacheline_aligned = { 0 }; -static unsigned long vic_tick[NR_CPUS] __cacheline_aligned = { 0 }; - -/* Since we can only use CPI0, we fake all the other CPIs */ -static unsigned long vic_cpi_mailbox[NR_CPUS] __cacheline_aligned; - -/* debugging routine to read the isr of the cpu's pic */ -static inline __u16 vic_read_isr(void) -{ - __u16 isr; - - outb(0x0b, 0xa0); - isr = inb(0xa0) << 8; - outb(0x0b, 0x20); - isr |= inb(0x20); - - return isr; -} - -static __init void qic_setup(void) -{ - if (!is_cpu_quad()) { - /* not a quad, no setup */ - return; - } - outb(QIC_DEFAULT_MASK0, QIC_MASK_REGISTER0); - outb(QIC_CPI_ENABLE, QIC_MASK_REGISTER1); - - if (is_cpu_extended()) { - /* the QIC duplicate of the VIC base register */ - outb(VIC_DEFAULT_CPI_BASE, QIC_VIC_CPI_BASE_REGISTER); - outb(QIC_DEFAULT_CPI_BASE, QIC_CPI_BASE_REGISTER); - - /* FIXME: should set up the QIC timer and memory parity - * error vectors here */ - } -} - -static __init void vic_setup_pic(void) -{ - outb(1, VIC_REDIRECT_REGISTER_1); - /* clear the claim registers for dynamic routing */ - outb(0, VIC_CLAIM_REGISTER_0); - outb(0, VIC_CLAIM_REGISTER_1); - - outb(0, VIC_PRIORITY_REGISTER); - /* Set the Primary and Secondary Microchannel vector - * bases to be the same as the ordinary interrupts - * - * FIXME: This would be more efficient using separate - * vectors. */ - outb(FIRST_EXTERNAL_VECTOR, VIC_PRIMARY_MC_BASE); - outb(FIRST_EXTERNAL_VECTOR, VIC_SECONDARY_MC_BASE); - /* Now initiallise the master PIC belonging to this CPU by - * sending the four ICWs */ - - /* ICW1: level triggered, ICW4 needed */ - outb(0x19, 0x20); - - /* ICW2: vector base */ - outb(FIRST_EXTERNAL_VECTOR, 0x21); - - /* ICW3: slave at line 2 */ - outb(0x04, 0x21); - - /* ICW4: 8086 mode */ - outb(0x01, 0x21); - - /* now the same for the slave PIC */ - - /* ICW1: level trigger, ICW4 needed */ - outb(0x19, 0xA0); - - /* ICW2: slave vector base */ - outb(FIRST_EXTERNAL_VECTOR + 8, 0xA1); - - /* ICW3: slave ID */ - outb(0x02, 0xA1); - - /* ICW4: 8086 mode */ - outb(0x01, 0xA1); -} - -static void do_quad_bootstrap(void) -{ - if (is_cpu_quad() && is_cpu_vic_boot()) { - int i; - unsigned long flags; - __u8 cpuid = hard_smp_processor_id(); - - local_irq_save(flags); - - for (i = 0; i < 4; i++) { - /* FIXME: this would be >>3 &0x7 on the 32 way */ - if (((cpuid >> 2) & 0x03) == i) - /* don't lower our own mask! */ - continue; - - /* masquerade as local Quad CPU */ - outb(QIC_CPUID_ENABLE | i, QIC_PROCESSOR_ID); - /* enable the startup CPI */ - outb(QIC_BOOT_CPI_MASK, QIC_MASK_REGISTER1); - /* restore cpu id */ - outb(0, QIC_PROCESSOR_ID); - } - local_irq_restore(flags); - } -} - -void prefill_possible_map(void) -{ - /* This is empty on voyager because we need a much - * earlier detection which is done in find_smp_config */ -} - -/* Set up all the basic stuff: read the SMP config and make all the - * SMP information reflect only the boot cpu. All others will be - * brought on-line later. */ -void __init find_smp_config(void) -{ - int i; - - boot_cpu_id = hard_smp_processor_id(); - - printk("VOYAGER SMP: Boot cpu is %d\n", boot_cpu_id); - - /* initialize the CPU structures (moved from smp_boot_cpus) */ - for (i = 0; i < nr_cpu_ids; i++) - cpu_irq_affinity[i] = ~0; - cpu_online_map = cpumask_of_cpu(boot_cpu_id); - - /* The boot CPU must be extended */ - voyager_extended_vic_processors = 1 << boot_cpu_id; - /* initially, all of the first 8 CPUs can boot */ - voyager_allowed_boot_processors = 0xff; - /* set up everything for just this CPU, we can alter - * this as we start the other CPUs later */ - /* now get the CPU disposition from the extended CMOS */ - cpus_addr(phys_cpu_present_map)[0] = - voyager_extended_cmos_read(VOYAGER_PROCESSOR_PRESENT_MASK); - cpus_addr(phys_cpu_present_map)[0] |= - voyager_extended_cmos_read(VOYAGER_PROCESSOR_PRESENT_MASK + 1) << 8; - cpus_addr(phys_cpu_present_map)[0] |= - voyager_extended_cmos_read(VOYAGER_PROCESSOR_PRESENT_MASK + - 2) << 16; - cpus_addr(phys_cpu_present_map)[0] |= - voyager_extended_cmos_read(VOYAGER_PROCESSOR_PRESENT_MASK + - 3) << 24; - init_cpu_possible(&phys_cpu_present_map); - printk("VOYAGER SMP: phys_cpu_present_map = 0x%lx\n", - cpus_addr(phys_cpu_present_map)[0]); - /* Here we set up the VIC to enable SMP */ - /* enable the CPIs by writing the base vector to their register */ - outb(VIC_DEFAULT_CPI_BASE, VIC_CPI_BASE_REGISTER); - outb(1, VIC_REDIRECT_REGISTER_1); - /* set the claim registers for static routing --- Boot CPU gets - * all interrupts untill all other CPUs started */ - outb(0xff, VIC_CLAIM_REGISTER_0); - outb(0xff, VIC_CLAIM_REGISTER_1); - /* Set the Primary and Secondary Microchannel vector - * bases to be the same as the ordinary interrupts - * - * FIXME: This would be more efficient using separate - * vectors. */ - outb(FIRST_EXTERNAL_VECTOR, VIC_PRIMARY_MC_BASE); - outb(FIRST_EXTERNAL_VECTOR, VIC_SECONDARY_MC_BASE); - - /* Finally tell the firmware that we're driving */ - outb(inb(VOYAGER_SUS_IN_CONTROL_PORT) | VOYAGER_IN_CONTROL_FLAG, - VOYAGER_SUS_IN_CONTROL_PORT); - - current_thread_info()->cpu = boot_cpu_id; - x86_write_percpu(cpu_number, boot_cpu_id); -} - -/* - * The bootstrap kernel entry code has set these up. Save them - * for a given CPU, id is physical */ -void __init smp_store_cpu_info(int id) -{ - struct cpuinfo_x86 *c = &cpu_data(id); - - *c = boot_cpu_data; - c->cpu_index = id; - - identify_secondary_cpu(c); -} - -/* Routine initially called when a non-boot CPU is brought online */ -static void __init start_secondary(void *unused) -{ - __u8 cpuid = hard_smp_processor_id(); - - cpu_init(); - - /* OK, we're in the routine */ - ack_CPI(VIC_CPU_BOOT_CPI); - - /* setup the 8259 master slave pair belonging to this CPU --- - * we won't actually receive any until the boot CPU - * relinquishes it's static routing mask */ - vic_setup_pic(); - - qic_setup(); - - if (is_cpu_quad() && !is_cpu_vic_boot()) { - /* clear the boot CPI */ - __u8 dummy; - - dummy = - voyager_quad_cpi_addr[cpuid]->qic_cpi[VIC_CPU_BOOT_CPI].cpi; - printk("read dummy %d\n", dummy); - } - - /* lower the mask to receive CPIs */ - vic_enable_cpi(); - - VDEBUG(("VOYAGER SMP: CPU%d, stack at about %p\n", cpuid, &cpuid)); - - notify_cpu_starting(cpuid); - - /* enable interrupts */ - local_irq_enable(); - - /* get our bogomips */ - calibrate_delay(); - - /* save our processor parameters */ - smp_store_cpu_info(cpuid); - - /* if we're a quad, we may need to bootstrap other CPUs */ - do_quad_bootstrap(); - - /* FIXME: this is rather a poor hack to prevent the CPU - * activating softirqs while it's supposed to be waiting for - * permission to proceed. Without this, the new per CPU stuff - * in the softirqs will fail */ - local_irq_disable(); - cpu_set(cpuid, cpu_callin_map); - - /* signal that we're done */ - cpu_booted_map = 1; - - while (!cpu_isset(cpuid, smp_commenced_mask)) - rep_nop(); - local_irq_enable(); - - local_flush_tlb(); - - cpu_set(cpuid, cpu_online_map); - wmb(); - cpu_idle(); -} - -/* Routine to kick start the given CPU and wait for it to report ready - * (or timeout in startup). When this routine returns, the requested - * CPU is either fully running and configured or known to be dead. - * - * We call this routine sequentially 1 CPU at a time, so no need for - * locking */ - -static void __init do_boot_cpu(__u8 cpu) -{ - struct task_struct *idle; - int timeout; - unsigned long flags; - int quad_boot = (1 << cpu) & voyager_quad_processors - & ~(voyager_extended_vic_processors - & voyager_allowed_boot_processors); - - /* This is the format of the CPI IDT gate (in real mode) which - * we're hijacking to boot the CPU */ - union IDTFormat { - struct seg { - __u16 Offset; - __u16 Segment; - } idt; - __u32 val; - } hijack_source; - - __u32 *hijack_vector; - __u32 start_phys_address = setup_trampoline(); - - /* There's a clever trick to this: The linux trampoline is - * compiled to begin at absolute location zero, so make the - * address zero but have the data segment selector compensate - * for the actual address */ - hijack_source.idt.Offset = start_phys_address & 0x000F; - hijack_source.idt.Segment = (start_phys_address >> 4) & 0xFFFF; - - cpucount++; - alternatives_smp_switch(1); - - idle = fork_idle(cpu); - if (IS_ERR(idle)) - panic("failed fork for CPU%d", cpu); - idle->thread.ip = (unsigned long)start_secondary; - /* init_tasks (in sched.c) is indexed logically */ - stack_start.sp = (void *)idle->thread.sp; - - init_gdt(cpu); - per_cpu(current_task, cpu) = idle; - early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu); - irq_ctx_init(cpu); - - /* Note: Don't modify initial ss override */ - VDEBUG(("VOYAGER SMP: Booting CPU%d at 0x%lx[%x:%x], stack %p\n", cpu, - (unsigned long)hijack_source.val, hijack_source.idt.Segment, - hijack_source.idt.Offset, stack_start.sp)); - - /* init lowmem identity mapping */ - clone_pgd_range(swapper_pg_dir, swapper_pg_dir + KERNEL_PGD_BOUNDARY, - min_t(unsigned long, KERNEL_PGD_PTRS, KERNEL_PGD_BOUNDARY)); - flush_tlb_all(); - - if (quad_boot) { - printk("CPU %d: non extended Quad boot\n", cpu); - hijack_vector = - (__u32 *) - phys_to_virt((VIC_CPU_BOOT_CPI + QIC_DEFAULT_CPI_BASE) * 4); - *hijack_vector = hijack_source.val; - } else { - printk("CPU%d: extended VIC boot\n", cpu); - hijack_vector = - (__u32 *) - phys_to_virt((VIC_CPU_BOOT_CPI + VIC_DEFAULT_CPI_BASE) * 4); - *hijack_vector = hijack_source.val; - /* VIC errata, may also receive interrupt at this address */ - hijack_vector = - (__u32 *) - phys_to_virt((VIC_CPU_BOOT_ERRATA_CPI + - VIC_DEFAULT_CPI_BASE) * 4); - *hijack_vector = hijack_source.val; - } - /* All non-boot CPUs start with interrupts fully masked. Need - * to lower the mask of the CPI we're about to send. We do - * this in the VIC by masquerading as the processor we're - * about to boot and lowering its interrupt mask */ - local_irq_save(flags); - if (quad_boot) { - send_one_QIC_CPI(cpu, VIC_CPU_BOOT_CPI); - } else { - outb(VIC_CPU_MASQUERADE_ENABLE | cpu, VIC_PROCESSOR_ID); - /* here we're altering registers belonging to `cpu' */ - - outb(VIC_BOOT_INTERRUPT_MASK, 0x21); - /* now go back to our original identity */ - outb(boot_cpu_id, VIC_PROCESSOR_ID); - - /* and boot the CPU */ - - send_CPI((1 << cpu), VIC_CPU_BOOT_CPI); - } - cpu_booted_map = 0; - local_irq_restore(flags); - - /* now wait for it to become ready (or timeout) */ - for (timeout = 0; timeout < 50000; timeout++) { - if (cpu_booted_map) - break; - udelay(100); - } - /* reset the page table */ - zap_low_mappings(); - - if (cpu_booted_map) { - VDEBUG(("CPU%d: Booted successfully, back in CPU %d\n", - cpu, smp_processor_id())); - - printk("CPU%d: ", cpu); - print_cpu_info(&cpu_data(cpu)); - wmb(); - cpu_set(cpu, cpu_callout_map); - cpu_set(cpu, cpu_present_map); - } else { - printk("CPU%d FAILED TO BOOT: ", cpu); - if (* - ((volatile unsigned char *)phys_to_virt(start_phys_address)) - == 0xA5) - printk("Stuck.\n"); - else - printk("Not responding.\n"); - - cpucount--; - } -} - -void __init smp_boot_cpus(void) -{ - int i; - - /* CAT BUS initialisation must be done after the memory */ - /* FIXME: The L4 has a catbus too, it just needs to be - * accessed in a totally different way */ - if (voyager_level == 5) { - voyager_cat_init(); - - /* now that the cat has probed the Voyager System Bus, sanity - * check the cpu map */ - if (((voyager_quad_processors | voyager_extended_vic_processors) - & cpus_addr(phys_cpu_present_map)[0]) != - cpus_addr(phys_cpu_present_map)[0]) { - /* should panic */ - printk("\n\n***WARNING*** " - "Sanity check of CPU present map FAILED\n"); - } - } else if (voyager_level == 4) - voyager_extended_vic_processors = - cpus_addr(phys_cpu_present_map)[0]; - - /* this sets up the idle task to run on the current cpu */ - voyager_extended_cpus = 1; - /* Remove the global_irq_holder setting, it triggers a BUG() on - * schedule at the moment */ - //global_irq_holder = boot_cpu_id; - - /* FIXME: Need to do something about this but currently only works - * on CPUs with a tsc which none of mine have. - smp_tune_scheduling(); - */ - smp_store_cpu_info(boot_cpu_id); - /* setup the jump vector */ - initial_code = (unsigned long)initialize_secondary; - printk("CPU%d: ", boot_cpu_id); - print_cpu_info(&cpu_data(boot_cpu_id)); - - if (is_cpu_quad()) { - /* booting on a Quad CPU */ - printk("VOYAGER SMP: Boot CPU is Quad\n"); - qic_setup(); - do_quad_bootstrap(); - } - - /* enable our own CPIs */ - vic_enable_cpi(); - - cpu_set(boot_cpu_id, cpu_online_map); - cpu_set(boot_cpu_id, cpu_callout_map); - - /* loop over all the extended VIC CPUs and boot them. The - * Quad CPUs must be bootstrapped by their extended VIC cpu */ - for (i = 0; i < nr_cpu_ids; i++) { - if (i == boot_cpu_id || !cpu_isset(i, phys_cpu_present_map)) - continue; - do_boot_cpu(i); - /* This udelay seems to be needed for the Quad boots - * don't remove unless you know what you're doing */ - udelay(1000); - } - /* we could compute the total bogomips here, but why bother?, - * Code added from smpboot.c */ - { - unsigned long bogosum = 0; - - for_each_online_cpu(i) - bogosum += cpu_data(i).loops_per_jiffy; - printk(KERN_INFO "Total of %d processors activated " - "(%lu.%02lu BogoMIPS).\n", - cpucount + 1, bogosum / (500000 / HZ), - (bogosum / (5000 / HZ)) % 100); - } - voyager_extended_cpus = hweight32(voyager_extended_vic_processors); - printk("VOYAGER: Extended (interrupt handling CPUs): " - "%d, non-extended: %d\n", voyager_extended_cpus, - num_booting_cpus() - voyager_extended_cpus); - /* that's it, switch to symmetric mode */ - outb(0, VIC_PRIORITY_REGISTER); - outb(0, VIC_CLAIM_REGISTER_0); - outb(0, VIC_CLAIM_REGISTER_1); - - VDEBUG(("VOYAGER SMP: Booted with %d CPUs\n", num_booting_cpus())); -} - -/* Reload the secondary CPUs task structure (this function does not - * return ) */ -static void __init initialize_secondary(void) -{ -#if 0 - // AC kernels only - set_current(hard_get_current()); -#endif - - /* - * We don't actually need to load the full TSS, - * basically just the stack pointer and the eip. - */ - - asm volatile ("movl %0,%%esp\n\t" - "jmp *%1"::"r" (current->thread.sp), - "r"(current->thread.ip)); -} - -/* handle a Voyager SYS_INT -- If we don't, the base board will - * panic the system. - * - * System interrupts occur because some problem was detected on the - * various busses. To find out what you have to probe all the - * hardware via the CAT bus. FIXME: At the moment we do nothing. */ -void smp_vic_sys_interrupt(struct pt_regs *regs) -{ - ack_CPI(VIC_SYS_INT); - printk("Voyager SYSTEM INTERRUPT\n"); -} - -/* Handle a voyager CMN_INT; These interrupts occur either because of - * a system status change or because a single bit memory error - * occurred. FIXME: At the moment, ignore all this. */ -void smp_vic_cmn_interrupt(struct pt_regs *regs) -{ - static __u8 in_cmn_int = 0; - static DEFINE_SPINLOCK(cmn_int_lock); - - /* common ints are broadcast, so make sure we only do this once */ - _raw_spin_lock(&cmn_int_lock); - if (in_cmn_int) - goto unlock_end; - - in_cmn_int++; - _raw_spin_unlock(&cmn_int_lock); - - VDEBUG(("Voyager COMMON INTERRUPT\n")); - - if (voyager_level == 5) - voyager_cat_do_common_interrupt(); - - _raw_spin_lock(&cmn_int_lock); - in_cmn_int = 0; - unlock_end: - _raw_spin_unlock(&cmn_int_lock); - ack_CPI(VIC_CMN_INT); -} - -/* - * Reschedule call back. Nothing to do, all the work is done - * automatically when we return from the interrupt. */ -static void smp_reschedule_interrupt(void) -{ - /* do nothing */ -} - -static struct mm_struct *flush_mm; -static unsigned long flush_va; -static DEFINE_SPINLOCK(tlbstate_lock); - -/* - * We cannot call mmdrop() because we are in interrupt context, - * instead update mm->cpu_vm_mask. - * - * We need to reload %cr3 since the page tables may be going - * away from under us.. - */ -static inline void voyager_leave_mm(unsigned long cpu) -{ - if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) - BUG(); - cpu_clear(cpu, per_cpu(cpu_tlbstate, cpu).active_mm->cpu_vm_mask); - load_cr3(swapper_pg_dir); -} - -/* - * Invalidate call-back - */ -static void smp_invalidate_interrupt(void) -{ - __u8 cpu = smp_processor_id(); - - if (!test_bit(cpu, &smp_invalidate_needed)) - return; - /* This will flood messages. Don't uncomment unless you see - * Problems with cross cpu invalidation - VDEBUG(("VOYAGER SMP: CPU%d received INVALIDATE_CPI\n", - smp_processor_id())); - */ - - if (flush_mm == per_cpu(cpu_tlbstate, cpu).active_mm) { - if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) { - if (flush_va == TLB_FLUSH_ALL) - local_flush_tlb(); - else - __flush_tlb_one(flush_va); - } else - voyager_leave_mm(cpu); - } - smp_mb__before_clear_bit(); - clear_bit(cpu, &smp_invalidate_needed); - smp_mb__after_clear_bit(); -} - -/* All the new flush operations for 2.4 */ - -/* This routine is called with a physical cpu mask */ -static void -voyager_flush_tlb_others(unsigned long cpumask, struct mm_struct *mm, - unsigned long va) -{ - int stuck = 50000; - - if (!cpumask) - BUG(); - if ((cpumask & cpus_addr(cpu_online_map)[0]) != cpumask) - BUG(); - if (cpumask & (1 << smp_processor_id())) - BUG(); - if (!mm) - BUG(); - - spin_lock(&tlbstate_lock); - - flush_mm = mm; - flush_va = va; - atomic_set_mask(cpumask, &smp_invalidate_needed); - /* - * We have to send the CPI only to - * CPUs affected. - */ - send_CPI(cpumask, VIC_INVALIDATE_CPI); - - while (smp_invalidate_needed) { - mb(); - if (--stuck == 0) { - printk("***WARNING*** Stuck doing invalidate CPI " - "(CPU%d)\n", smp_processor_id()); - break; - } - } - - /* Uncomment only to debug invalidation problems - VDEBUG(("VOYAGER SMP: Completed invalidate CPI (CPU%d)\n", cpu)); - */ - - flush_mm = NULL; - flush_va = 0; - spin_unlock(&tlbstate_lock); -} - -void flush_tlb_current_task(void) -{ - struct mm_struct *mm = current->mm; - unsigned long cpu_mask; - - preempt_disable(); - - cpu_mask = cpus_addr(mm->cpu_vm_mask)[0] & ~(1 << smp_processor_id()); - local_flush_tlb(); - if (cpu_mask) - voyager_flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - - preempt_enable(); -} - -void flush_tlb_mm(struct mm_struct *mm) -{ - unsigned long cpu_mask; - - preempt_disable(); - - cpu_mask = cpus_addr(mm->cpu_vm_mask)[0] & ~(1 << smp_processor_id()); - - if (current->active_mm == mm) { - if (current->mm) - local_flush_tlb(); - else - voyager_leave_mm(smp_processor_id()); - } - if (cpu_mask) - voyager_flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL); - - preempt_enable(); -} - -void flush_tlb_page(struct vm_area_struct *vma, unsigned long va) -{ - struct mm_struct *mm = vma->vm_mm; - unsigned long cpu_mask; - - preempt_disable(); - - cpu_mask = cpus_addr(mm->cpu_vm_mask)[0] & ~(1 << smp_processor_id()); - if (current->active_mm == mm) { - if (current->mm) - __flush_tlb_one(va); - else - voyager_leave_mm(smp_processor_id()); - } - - if (cpu_mask) - voyager_flush_tlb_others(cpu_mask, mm, va); - - preempt_enable(); -} - -EXPORT_SYMBOL(flush_tlb_page); - -/* enable the requested IRQs */ -static void smp_enable_irq_interrupt(void) -{ - __u8 irq; - __u8 cpu = get_cpu(); - - VDEBUG(("VOYAGER SMP: CPU%d enabling irq mask 0x%x\n", cpu, - vic_irq_enable_mask[cpu])); - - spin_lock(&vic_irq_lock); - for (irq = 0; irq < 16; irq++) { - if (vic_irq_enable_mask[cpu] & (1 << irq)) - enable_local_vic_irq(irq); - } - vic_irq_enable_mask[cpu] = 0; - spin_unlock(&vic_irq_lock); - - put_cpu_no_resched(); -} - -/* - * CPU halt call-back - */ -static void smp_stop_cpu_function(void *dummy) -{ - VDEBUG(("VOYAGER SMP: CPU%d is STOPPING\n", smp_processor_id())); - cpu_clear(smp_processor_id(), cpu_online_map); - local_irq_disable(); - for (;;) - halt(); -} - -/* execute a thread on a new CPU. The function to be called must be - * previously set up. This is used to schedule a function for - * execution on all CPUs - set up the function then broadcast a - * function_interrupt CPI to come here on each CPU */ -static void smp_call_function_interrupt(void) -{ - irq_enter(); - generic_smp_call_function_interrupt(); - __get_cpu_var(irq_stat).irq_call_count++; - irq_exit(); -} - -static void smp_call_function_single_interrupt(void) -{ - irq_enter(); - generic_smp_call_function_single_interrupt(); - __get_cpu_var(irq_stat).irq_call_count++; - irq_exit(); -} - -/* Sorry about the name. In an APIC based system, the APICs - * themselves are programmed to send a timer interrupt. This is used - * by linux to reschedule the processor. Voyager doesn't have this, - * so we use the system clock to interrupt one processor, which in - * turn, broadcasts a timer CPI to all the others --- we receive that - * CPI here. We don't use this actually for counting so losing - * ticks doesn't matter - * - * FIXME: For those CPUs which actually have a local APIC, we could - * try to use it to trigger this interrupt instead of having to - * broadcast the timer tick. Unfortunately, all my pentium DYADs have - * no local APIC, so I can't do this - * - * This function is currently a placeholder and is unused in the code */ -void smp_apic_timer_interrupt(struct pt_regs *regs) -{ - struct pt_regs *old_regs = set_irq_regs(regs); - wrapper_smp_local_timer_interrupt(); - set_irq_regs(old_regs); -} - -/* All of the QUAD interrupt GATES */ -void smp_qic_timer_interrupt(struct pt_regs *regs) -{ - struct pt_regs *old_regs = set_irq_regs(regs); - ack_QIC_CPI(QIC_TIMER_CPI); - wrapper_smp_local_timer_interrupt(); - set_irq_regs(old_regs); -} - -void smp_qic_invalidate_interrupt(struct pt_regs *regs) -{ - ack_QIC_CPI(QIC_INVALIDATE_CPI); - smp_invalidate_interrupt(); -} - -void smp_qic_reschedule_interrupt(struct pt_regs *regs) -{ - ack_QIC_CPI(QIC_RESCHEDULE_CPI); - smp_reschedule_interrupt(); -} - -void smp_qic_enable_irq_interrupt(struct pt_regs *regs) -{ - ack_QIC_CPI(QIC_ENABLE_IRQ_CPI); - smp_enable_irq_interrupt(); -} - -void smp_qic_call_function_interrupt(struct pt_regs *regs) -{ - ack_QIC_CPI(QIC_CALL_FUNCTION_CPI); - smp_call_function_interrupt(); -} - -void smp_qic_call_function_single_interrupt(struct pt_regs *regs) -{ - ack_QIC_CPI(QIC_CALL_FUNCTION_SINGLE_CPI); - smp_call_function_single_interrupt(); -} - -void smp_vic_cpi_interrupt(struct pt_regs *regs) -{ - struct pt_regs *old_regs = set_irq_regs(regs); - __u8 cpu = smp_processor_id(); - - if (is_cpu_quad()) - ack_QIC_CPI(VIC_CPI_LEVEL0); - else - ack_VIC_CPI(VIC_CPI_LEVEL0); - - if (test_and_clear_bit(VIC_TIMER_CPI, &vic_cpi_mailbox[cpu])) - wrapper_smp_local_timer_interrupt(); - if (test_and_clear_bit(VIC_INVALIDATE_CPI, &vic_cpi_mailbox[cpu])) - smp_invalidate_interrupt(); - if (test_and_clear_bit(VIC_RESCHEDULE_CPI, &vic_cpi_mailbox[cpu])) - smp_reschedule_interrupt(); - if (test_and_clear_bit(VIC_ENABLE_IRQ_CPI, &vic_cpi_mailbox[cpu])) - smp_enable_irq_interrupt(); - if (test_and_clear_bit(VIC_CALL_FUNCTION_CPI, &vic_cpi_mailbox[cpu])) - smp_call_function_interrupt(); - if (test_and_clear_bit(VIC_CALL_FUNCTION_SINGLE_CPI, &vic_cpi_mailbox[cpu])) - smp_call_function_single_interrupt(); - set_irq_regs(old_regs); -} - -static void do_flush_tlb_all(void *info) -{ - unsigned long cpu = smp_processor_id(); - - __flush_tlb_all(); - if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_LAZY) - voyager_leave_mm(cpu); -} - -/* flush the TLB of every active CPU in the system */ -void flush_tlb_all(void) -{ - on_each_cpu(do_flush_tlb_all, 0, 1); -} - -/* send a reschedule CPI to one CPU by physical CPU number*/ -static void voyager_smp_send_reschedule(int cpu) -{ - send_one_CPI(cpu, VIC_RESCHEDULE_CPI); -} - -int hard_smp_processor_id(void) -{ - __u8 i; - __u8 cpumask = inb(VIC_PROC_WHO_AM_I); - if ((cpumask & QUAD_IDENTIFIER) == QUAD_IDENTIFIER) - return cpumask & 0x1F; - - for (i = 0; i < 8; i++) { - if (cpumask & (1 << i)) - return i; - } - printk("** WARNING ** Illegal cpuid returned by VIC: %d", cpumask); - return 0; -} - -int safe_smp_processor_id(void) -{ - return hard_smp_processor_id(); -} - -/* broadcast a halt to all other CPUs */ -static void voyager_smp_send_stop(void) -{ - smp_call_function(smp_stop_cpu_function, NULL, 1); -} - -/* this function is triggered in time.c when a clock tick fires - * we need to re-broadcast the tick to all CPUs */ -void smp_vic_timer_interrupt(void) -{ - send_CPI_allbutself(VIC_TIMER_CPI); - smp_local_timer_interrupt(); -} - -/* local (per CPU) timer interrupt. It does both profiling and - * process statistics/rescheduling. - * - * We do profiling in every local tick, statistics/rescheduling - * happen only every 'profiling multiplier' ticks. The default - * multiplier is 1 and it can be changed by writing the new multiplier - * value into /proc/profile. - */ -void smp_local_timer_interrupt(void) -{ - int cpu = smp_processor_id(); - long weight; - - profile_tick(CPU_PROFILING); - if (--per_cpu(prof_counter, cpu) <= 0) { - /* - * The multiplier may have changed since the last time we got - * to this point as a result of the user writing to - * /proc/profile. In this case we need to adjust the APIC - * timer accordingly. - * - * Interrupts are already masked off at this point. - */ - per_cpu(prof_counter, cpu) = per_cpu(prof_multiplier, cpu); - if (per_cpu(prof_counter, cpu) != - per_cpu(prof_old_multiplier, cpu)) { - /* FIXME: need to update the vic timer tick here */ - per_cpu(prof_old_multiplier, cpu) = - per_cpu(prof_counter, cpu); - } - - update_process_times(user_mode_vm(get_irq_regs())); - } - - if (((1 << cpu) & voyager_extended_vic_processors) == 0) - /* only extended VIC processors participate in - * interrupt distribution */ - return; - - /* - * We take the 'long' return path, and there every subsystem - * grabs the appropriate locks (kernel lock/ irq lock). - * - * we might want to decouple profiling from the 'long path', - * and do the profiling totally in assembly. - * - * Currently this isn't too much of an issue (performance wise), - * we can take more than 100K local irqs per second on a 100 MHz P5. - */ - - if ((++vic_tick[cpu] & 0x7) != 0) - return; - /* get here every 16 ticks (about every 1/6 of a second) */ - - /* Change our priority to give someone else a chance at getting - * the IRQ. The algorithm goes like this: - * - * In the VIC, the dynamically routed interrupt is always - * handled by the lowest priority eligible (i.e. receiving - * interrupts) CPU. If >1 eligible CPUs are equal lowest, the - * lowest processor number gets it. - * - * The priority of a CPU is controlled by a special per-CPU - * VIC priority register which is 3 bits wide 0 being lowest - * and 7 highest priority.. - * - * Therefore we subtract the average number of interrupts from - * the number we've fielded. If this number is negative, we - * lower the activity count and if it is positive, we raise - * it. - * - * I'm afraid this still leads to odd looking interrupt counts: - * the totals are all roughly equal, but the individual ones - * look rather skewed. - * - * FIXME: This algorithm is total crap when mixed with SMP - * affinity code since we now try to even up the interrupt - * counts when an affinity binding is keeping them on a - * particular CPU*/ - weight = (vic_intr_count[cpu] * voyager_extended_cpus - - vic_intr_total) >> 4; - weight += 4; - if (weight > 7) - weight = 7; - if (weight < 0) - weight = 0; - - outb((__u8) weight, VIC_PRIORITY_REGISTER); - -#ifdef VOYAGER_DEBUG - if ((vic_tick[cpu] & 0xFFF) == 0) { - /* print this message roughly every 25 secs */ - printk("VOYAGER SMP: vic_tick[%d] = %lu, weight = %ld\n", - cpu, vic_tick[cpu], weight); - } -#endif -} - -/* setup the profiling timer */ -int setup_profiling_timer(unsigned int multiplier) -{ - int i; - - if ((!multiplier)) - return -EINVAL; - - /* - * Set the new multiplier for each CPU. CPUs don't start using the - * new values until the next timer interrupt in which they do process - * accounting. - */ - for (i = 0; i < nr_cpu_ids; ++i) - per_cpu(prof_multiplier, i) = multiplier; - - return 0; -} - -/* This is a bit of a mess, but forced on us by the genirq changes - * there's no genirq handler that really does what voyager wants - * so hack it up with the simple IRQ handler */ -static void handle_vic_irq(unsigned int irq, struct irq_desc *desc) -{ - before_handle_vic_irq(irq); - handle_simple_irq(irq, desc); - after_handle_vic_irq(irq); -} - -/* The CPIs are handled in the per cpu 8259s, so they must be - * enabled to be received: FIX: enabling the CPIs in the early - * boot sequence interferes with bug checking; enable them later - * on in smp_init */ -#define VIC_SET_GATE(cpi, vector) \ - set_intr_gate((cpi) + VIC_DEFAULT_CPI_BASE, (vector)) -#define QIC_SET_GATE(cpi, vector) \ - set_intr_gate((cpi) + QIC_DEFAULT_CPI_BASE, (vector)) - -void __init voyager_smp_intr_init(void) -{ - int i; - - /* initialize the per cpu irq mask to all disabled */ - for (i = 0; i < nr_cpu_ids; i++) - vic_irq_mask[i] = 0xFFFF; - - VIC_SET_GATE(VIC_CPI_LEVEL0, vic_cpi_interrupt); - - VIC_SET_GATE(VIC_SYS_INT, vic_sys_interrupt); - VIC_SET_GATE(VIC_CMN_INT, vic_cmn_interrupt); - - QIC_SET_GATE(QIC_TIMER_CPI, qic_timer_interrupt); - QIC_SET_GATE(QIC_INVALIDATE_CPI, qic_invalidate_interrupt); - QIC_SET_GATE(QIC_RESCHEDULE_CPI, qic_reschedule_interrupt); - QIC_SET_GATE(QIC_ENABLE_IRQ_CPI, qic_enable_irq_interrupt); - QIC_SET_GATE(QIC_CALL_FUNCTION_CPI, qic_call_function_interrupt); - - /* now put the VIC descriptor into the first 48 IRQs - * - * This is for later: first 16 correspond to PC IRQs; next 16 - * are Primary MC IRQs and final 16 are Secondary MC IRQs */ - for (i = 0; i < 48; i++) - set_irq_chip_and_handler(i, &vic_chip, handle_vic_irq); -} - -/* send a CPI at level cpi to a set of cpus in cpuset (set 1 bit per - * processor to receive CPI */ -static void send_CPI(__u32 cpuset, __u8 cpi) -{ - int cpu; - __u32 quad_cpuset = (cpuset & voyager_quad_processors); - - if (cpi < VIC_START_FAKE_CPI) { - /* fake CPI are only used for booting, so send to the - * extended quads as well---Quads must be VIC booted */ - outb((__u8) (cpuset), VIC_CPI_Registers[cpi]); - return; - } - if (quad_cpuset) - send_QIC_CPI(quad_cpuset, cpi); - cpuset &= ~quad_cpuset; - cpuset &= 0xff; /* only first 8 CPUs vaild for VIC CPI */ - if (cpuset == 0) - return; - for_each_online_cpu(cpu) { - if (cpuset & (1 << cpu)) - set_bit(cpi, &vic_cpi_mailbox[cpu]); - } - if (cpuset) - outb((__u8) cpuset, VIC_CPI_Registers[VIC_CPI_LEVEL0]); -} - -/* Acknowledge receipt of CPI in the QIC, clear in QIC hardware and - * set the cache line to shared by reading it. - * - * DON'T make this inline otherwise the cache line read will be - * optimised away - * */ -static int ack_QIC_CPI(__u8 cpi) -{ - __u8 cpu = hard_smp_processor_id(); - - cpi &= 7; - - outb(1 << cpi, QIC_INTERRUPT_CLEAR1); - return voyager_quad_cpi_addr[cpu]->qic_cpi[cpi].cpi; -} - -static void ack_special_QIC_CPI(__u8 cpi) -{ - switch (cpi) { - case VIC_CMN_INT: - outb(QIC_CMN_INT, QIC_INTERRUPT_CLEAR0); - break; - case VIC_SYS_INT: - outb(QIC_SYS_INT, QIC_INTERRUPT_CLEAR0); - break; - } - /* also clear at the VIC, just in case (nop for non-extended proc) */ - ack_VIC_CPI(cpi); -} - -/* Acknowledge receipt of CPI in the VIC (essentially an EOI) */ -static void ack_VIC_CPI(__u8 cpi) -{ -#ifdef VOYAGER_DEBUG - unsigned long flags; - __u16 isr; - __u8 cpu = smp_processor_id(); - - local_irq_save(flags); - isr = vic_read_isr(); - if ((isr & (1 << (cpi & 7))) == 0) { - printk("VOYAGER SMP: CPU%d lost CPI%d\n", cpu, cpi); - } -#endif - /* send specific EOI; the two system interrupts have - * bit 4 set for a separate vector but behave as the - * corresponding 3 bit intr */ - outb_p(0x60 | (cpi & 7), 0x20); - -#ifdef VOYAGER_DEBUG - if ((vic_read_isr() & (1 << (cpi & 7))) != 0) { - printk("VOYAGER SMP: CPU%d still asserting CPI%d\n", cpu, cpi); - } - local_irq_restore(flags); -#endif -} - -/* cribbed with thanks from irq.c */ -#define __byte(x,y) (((unsigned char *)&(y))[x]) -#define cached_21(cpu) (__byte(0,vic_irq_mask[cpu])) -#define cached_A1(cpu) (__byte(1,vic_irq_mask[cpu])) - -static unsigned int startup_vic_irq(unsigned int irq) -{ - unmask_vic_irq(irq); - - return 0; -} - -/* The enable and disable routines. This is where we run into - * conflicting architectural philosophy. Fundamentally, the voyager - * architecture does not expect to have to disable interrupts globally - * (the IRQ controllers belong to each CPU). The processor masquerade - * which is used to start the system shouldn't be used in a running OS - * since it will cause great confusion if two separate CPUs drive to - * the same IRQ controller (I know, I've tried it). - * - * The solution is a variant on the NCR lazy SPL design: - * - * 1) To disable an interrupt, do nothing (other than set the - * IRQ_DISABLED flag). This dares the interrupt actually to arrive. - * - * 2) If the interrupt dares to come in, raise the local mask against - * it (this will result in all the CPU masks being raised - * eventually). - * - * 3) To enable the interrupt, lower the mask on the local CPU and - * broadcast an Interrupt enable CPI which causes all other CPUs to - * adjust their masks accordingly. */ - -static void unmask_vic_irq(unsigned int irq) -{ - /* linux doesn't to processor-irq affinity, so enable on - * all CPUs we know about */ - int cpu = smp_processor_id(), real_cpu; - __u16 mask = (1 << irq); - __u32 processorList = 0; - unsigned long flags; - - VDEBUG(("VOYAGER: unmask_vic_irq(%d) CPU%d affinity 0x%lx\n", - irq, cpu, cpu_irq_affinity[cpu])); - spin_lock_irqsave(&vic_irq_lock, flags); - for_each_online_cpu(real_cpu) { - if (!(voyager_extended_vic_processors & (1 << real_cpu))) - continue; - if (!(cpu_irq_affinity[real_cpu] & mask)) { - /* irq has no affinity for this CPU, ignore */ - continue; - } - if (real_cpu == cpu) { - enable_local_vic_irq(irq); - } else if (vic_irq_mask[real_cpu] & mask) { - vic_irq_enable_mask[real_cpu] |= mask; - processorList |= (1 << real_cpu); - } - } - spin_unlock_irqrestore(&vic_irq_lock, flags); - if (processorList) - send_CPI(processorList, VIC_ENABLE_IRQ_CPI); -} - -static void mask_vic_irq(unsigned int irq) -{ - /* lazy disable, do nothing */ -} - -static void enable_local_vic_irq(unsigned int irq) -{ - __u8 cpu = smp_processor_id(); - __u16 mask = ~(1 << irq); - __u16 old_mask = vic_irq_mask[cpu]; - - vic_irq_mask[cpu] &= mask; - if (vic_irq_mask[cpu] == old_mask) - return; - - VDEBUG(("VOYAGER DEBUG: Enabling irq %d in hardware on CPU %d\n", - irq, cpu)); - - if (irq & 8) { - outb_p(cached_A1(cpu), 0xA1); - (void)inb_p(0xA1); - } else { - outb_p(cached_21(cpu), 0x21); - (void)inb_p(0x21); - } -} - -static void disable_local_vic_irq(unsigned int irq) -{ - __u8 cpu = smp_processor_id(); - __u16 mask = (1 << irq); - __u16 old_mask = vic_irq_mask[cpu]; - - if (irq == 7) - return; - - vic_irq_mask[cpu] |= mask; - if (old_mask == vic_irq_mask[cpu]) - return; - - VDEBUG(("VOYAGER DEBUG: Disabling irq %d in hardware on CPU %d\n", - irq, cpu)); - - if (irq & 8) { - outb_p(cached_A1(cpu), 0xA1); - (void)inb_p(0xA1); - } else { - outb_p(cached_21(cpu), 0x21); - (void)inb_p(0x21); - } -} - -/* The VIC is level triggered, so the ack can only be issued after the - * interrupt completes. However, we do Voyager lazy interrupt - * handling here: It is an extremely expensive operation to mask an - * interrupt in the vic, so we merely set a flag (IRQ_DISABLED). If - * this interrupt actually comes in, then we mask and ack here to push - * the interrupt off to another CPU */ -static void before_handle_vic_irq(unsigned int irq) -{ - irq_desc_t *desc = irq_to_desc(irq); - __u8 cpu = smp_processor_id(); - - _raw_spin_lock(&vic_irq_lock); - vic_intr_total++; - vic_intr_count[cpu]++; - - if (!(cpu_irq_affinity[cpu] & (1 << irq))) { - /* The irq is not in our affinity mask, push it off - * onto another CPU */ - VDEBUG(("VOYAGER DEBUG: affinity triggered disable of irq %d " - "on cpu %d\n", irq, cpu)); - disable_local_vic_irq(irq); - /* set IRQ_INPROGRESS to prevent the handler in irq.c from - * actually calling the interrupt routine */ - desc->status |= IRQ_REPLAY | IRQ_INPROGRESS; - } else if (desc->status & IRQ_DISABLED) { - /* Damn, the interrupt actually arrived, do the lazy - * disable thing. The interrupt routine in irq.c will - * not handle a IRQ_DISABLED interrupt, so nothing more - * need be done here */ - VDEBUG(("VOYAGER DEBUG: lazy disable of irq %d on CPU %d\n", - irq, cpu)); - disable_local_vic_irq(irq); - desc->status |= IRQ_REPLAY; - } else { - desc->status &= ~IRQ_REPLAY; - } - - _raw_spin_unlock(&vic_irq_lock); -} - -/* Finish the VIC interrupt: basically mask */ -static void after_handle_vic_irq(unsigned int irq) -{ - irq_desc_t *desc = irq_to_desc(irq); - - _raw_spin_lock(&vic_irq_lock); - { - unsigned int status = desc->status & ~IRQ_INPROGRESS; -#ifdef VOYAGER_DEBUG - __u16 isr; -#endif - - desc->status = status; - if ((status & IRQ_DISABLED)) - disable_local_vic_irq(irq); -#ifdef VOYAGER_DEBUG - /* DEBUG: before we ack, check what's in progress */ - isr = vic_read_isr(); - if ((isr & (1 << irq) && !(status & IRQ_REPLAY)) == 0) { - int i; - __u8 cpu = smp_processor_id(); - __u8 real_cpu; - int mask; /* Um... initialize me??? --RR */ - - printk("VOYAGER SMP: CPU%d lost interrupt %d\n", - cpu, irq); - for_each_possible_cpu(real_cpu, mask) { - - outb(VIC_CPU_MASQUERADE_ENABLE | real_cpu, - VIC_PROCESSOR_ID); - isr = vic_read_isr(); - if (isr & (1 << irq)) { - printk - ("VOYAGER SMP: CPU%d ack irq %d\n", - real_cpu, irq); - ack_vic_irq(irq); - } - outb(cpu, VIC_PROCESSOR_ID); - } - } -#endif /* VOYAGER_DEBUG */ - /* as soon as we ack, the interrupt is eligible for - * receipt by another CPU so everything must be in - * order here */ - ack_vic_irq(irq); - if (status & IRQ_REPLAY) { - /* replay is set if we disable the interrupt - * in the before_handle_vic_irq() routine, so - * clear the in progress bit here to allow the - * next CPU to handle this correctly */ - desc->status &= ~(IRQ_REPLAY | IRQ_INPROGRESS); - } -#ifdef VOYAGER_DEBUG - isr = vic_read_isr(); - if ((isr & (1 << irq)) != 0) - printk("VOYAGER SMP: after_handle_vic_irq() after " - "ack irq=%d, isr=0x%x\n", irq, isr); -#endif /* VOYAGER_DEBUG */ - } - _raw_spin_unlock(&vic_irq_lock); - - /* All code after this point is out of the main path - the IRQ - * may be intercepted by another CPU if reasserted */ -} - -/* Linux processor - interrupt affinity manipulations. - * - * For each processor, we maintain a 32 bit irq affinity mask. - * Initially it is set to all 1's so every processor accepts every - * interrupt. In this call, we change the processor's affinity mask: - * - * Change from enable to disable: - * - * If the interrupt ever comes in to the processor, we will disable it - * and ack it to push it off to another CPU, so just accept the mask here. - * - * Change from disable to enable: - * - * change the mask and then do an interrupt enable CPI to re-enable on - * the selected processors */ - -void set_vic_irq_affinity(unsigned int irq, const struct cpumask *mask) -{ - /* Only extended processors handle interrupts */ - unsigned long real_mask; - unsigned long irq_mask = 1 << irq; - int cpu; - - real_mask = cpus_addr(*mask)[0] & voyager_extended_vic_processors; - - if (cpus_addr(*mask)[0] == 0) - /* can't have no CPUs to accept the interrupt -- extremely - * bad things will happen */ - return; - - if (irq == 0) - /* can't change the affinity of the timer IRQ. This - * is due to the constraint in the voyager - * architecture that the CPI also comes in on and IRQ - * line and we have chosen IRQ0 for this. If you - * raise the mask on this interrupt, the processor - * will no-longer be able to accept VIC CPIs */ - return; - - if (irq >= 32) - /* You can only have 32 interrupts in a voyager system - * (and 32 only if you have a secondary microchannel - * bus) */ - return; - - for_each_online_cpu(cpu) { - unsigned long cpu_mask = 1 << cpu; - - if (cpu_mask & real_mask) { - /* enable the interrupt for this cpu */ - cpu_irq_affinity[cpu] |= irq_mask; - } else { - /* disable the interrupt for this cpu */ - cpu_irq_affinity[cpu] &= ~irq_mask; - } - } - /* this is magic, we now have the correct affinity maps, so - * enable the interrupt. This will send an enable CPI to - * those CPUs who need to enable it in their local masks, - * causing them to correct for the new affinity . If the - * interrupt is currently globally disabled, it will simply be - * disabled again as it comes in (voyager lazy disable). If - * the affinity map is tightened to disable the interrupt on a - * cpu, it will be pushed off when it comes in */ - unmask_vic_irq(irq); -} - -static void ack_vic_irq(unsigned int irq) -{ - if (irq & 8) { - outb(0x62, 0x20); /* Specific EOI to cascade */ - outb(0x60 | (irq & 7), 0xA0); - } else { - outb(0x60 | (irq & 7), 0x20); - } -} - -/* enable the CPIs. In the VIC, the CPIs are delivered by the 8259 - * but are not vectored by it. This means that the 8259 mask must be - * lowered to receive them */ -static __init void vic_enable_cpi(void) -{ - __u8 cpu = smp_processor_id(); - - /* just take a copy of the current mask (nop for boot cpu) */ - vic_irq_mask[cpu] = vic_irq_mask[boot_cpu_id]; - - enable_local_vic_irq(VIC_CPI_LEVEL0); - enable_local_vic_irq(VIC_CPI_LEVEL1); - /* for sys int and cmn int */ - enable_local_vic_irq(7); - - if (is_cpu_quad()) { - outb(QIC_DEFAULT_MASK0, QIC_MASK_REGISTER0); - outb(QIC_CPI_ENABLE, QIC_MASK_REGISTER1); - VDEBUG(("VOYAGER SMP: QIC ENABLE CPI: CPU%d: MASK 0x%x\n", - cpu, QIC_CPI_ENABLE)); - } - - VDEBUG(("VOYAGER SMP: ENABLE CPI: CPU%d: MASK 0x%x\n", - cpu, vic_irq_mask[cpu])); -} - -void voyager_smp_dump() -{ - int old_cpu = smp_processor_id(), cpu; - - /* dump the interrupt masks of each processor */ - for_each_online_cpu(cpu) { - __u16 imr, isr, irr; - unsigned long flags; - - local_irq_save(flags); - outb(VIC_CPU_MASQUERADE_ENABLE | cpu, VIC_PROCESSOR_ID); - imr = (inb(0xa1) << 8) | inb(0x21); - outb(0x0a, 0xa0); - irr = inb(0xa0) << 8; - outb(0x0a, 0x20); - irr |= inb(0x20); - outb(0x0b, 0xa0); - isr = inb(0xa0) << 8; - outb(0x0b, 0x20); - isr |= inb(0x20); - outb(old_cpu, VIC_PROCESSOR_ID); - local_irq_restore(flags); - printk("\tCPU%d: mask=0x%x, IMR=0x%x, IRR=0x%x, ISR=0x%x\n", - cpu, vic_irq_mask[cpu], imr, irr, isr); -#if 0 - /* These lines are put in to try to unstick an un ack'd irq */ - if (isr != 0) { - int irq; - for (irq = 0; irq < 16; irq++) { - if (isr & (1 << irq)) { - printk("\tCPU%d: ack irq %d\n", - cpu, irq); - local_irq_save(flags); - outb(VIC_CPU_MASQUERADE_ENABLE | cpu, - VIC_PROCESSOR_ID); - ack_vic_irq(irq); - outb(old_cpu, VIC_PROCESSOR_ID); - local_irq_restore(flags); - } - } - } -#endif - } -} - -void smp_voyager_power_off(void *dummy) -{ - if (smp_processor_id() == boot_cpu_id) - voyager_power_off(); - else - smp_stop_cpu_function(NULL); -} - -static void __init voyager_smp_prepare_cpus(unsigned int max_cpus) -{ - /* FIXME: ignore max_cpus for now */ - smp_boot_cpus(); -} - -static void __cpuinit voyager_smp_prepare_boot_cpu(void) -{ - init_gdt(smp_processor_id()); - switch_to_new_gdt(); - - cpu_online_map = cpumask_of_cpu(smp_processor_id()); - cpu_callout_map = cpumask_of_cpu(smp_processor_id()); - cpu_callin_map = CPU_MASK_NONE; - cpu_present_map = cpumask_of_cpu(smp_processor_id()); - -} - -static int __cpuinit voyager_cpu_up(unsigned int cpu) -{ - /* This only works at boot for x86. See "rewrite" above. */ - if (cpu_isset(cpu, smp_commenced_mask)) - return -ENOSYS; - - /* In case one didn't come up */ - if (!cpu_isset(cpu, cpu_callin_map)) - return -EIO; - /* Unleash the CPU! */ - cpu_set(cpu, smp_commenced_mask); - while (!cpu_online(cpu)) - mb(); - return 0; -} - -static void __init voyager_smp_cpus_done(unsigned int max_cpus) -{ - zap_low_mappings(); -} - -void __init smp_setup_processor_id(void) -{ - current_thread_info()->cpu = hard_smp_processor_id(); - x86_write_percpu(cpu_number, hard_smp_processor_id()); -} - -static void voyager_send_call_func(const struct cpumask *callmask) -{ - __u32 mask = cpus_addr(*callmask)[0] & ~(1 << smp_processor_id()); - send_CPI(mask, VIC_CALL_FUNCTION_CPI); -} - -static void voyager_send_call_func_single(int cpu) -{ - send_CPI(1 << cpu, VIC_CALL_FUNCTION_SINGLE_CPI); -} - -struct smp_ops smp_ops = { - .smp_prepare_boot_cpu = voyager_smp_prepare_boot_cpu, - .smp_prepare_cpus = voyager_smp_prepare_cpus, - .cpu_up = voyager_cpu_up, - .smp_cpus_done = voyager_smp_cpus_done, - - .smp_send_stop = voyager_smp_send_stop, - .smp_send_reschedule = voyager_smp_send_reschedule, - - .send_call_func_ipi = voyager_send_call_func, - .send_call_func_single_ipi = voyager_send_call_func_single, -}; Index: linux-2.6-tip/arch/x86/mach-voyager/voyager_thread.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mach-voyager/voyager_thread.c +++ /dev/null @@ -1,128 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8 -*- */ - -/* Copyright (C) 2001 - * - * Author: J.E.J.Bottomley@HansenPartnership.com - * - * This module provides the machine status monitor thread for the - * voyager architecture. This allows us to monitor the machine - * environment (temp, voltage, fan function) and the front panel and - * internal UPS. If a fault is detected, this thread takes corrective - * action (usually just informing init) - * */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -struct task_struct *voyager_thread; -static __u8 set_timeout; - -static int execute(const char *string) -{ - int ret; - - char *envp[] = { - "HOME=/", - "TERM=linux", - "PATH=/sbin:/usr/sbin:/bin:/usr/bin", - NULL, - }; - char *argv[] = { - "/bin/bash", - "-c", - (char *)string, - NULL, - }; - - if ((ret = - call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC)) != 0) { - printk(KERN_ERR "Voyager failed to run \"%s\": %i\n", string, - ret); - } - return ret; -} - -static void check_from_kernel(void) -{ - if (voyager_status.switch_off) { - - /* FIXME: This should be configurable via proc */ - execute("umask 600; echo 0 > /etc/initrunlvl; kill -HUP 1"); - } else if (voyager_status.power_fail) { - VDEBUG(("Voyager daemon detected AC power failure\n")); - - /* FIXME: This should be configureable via proc */ - execute("umask 600; echo F > /etc/powerstatus; kill -PWR 1"); - set_timeout = 1; - } -} - -static void check_continuing_condition(void) -{ - if (voyager_status.power_fail) { - __u8 data; - voyager_cat_psi(VOYAGER_PSI_SUBREAD, - VOYAGER_PSI_AC_FAIL_REG, &data); - if ((data & 0x1f) == 0) { - /* all power restored */ - printk(KERN_NOTICE - "VOYAGER AC power restored, cancelling shutdown\n"); - /* FIXME: should be user configureable */ - execute - ("umask 600; echo O > /etc/powerstatus; kill -PWR 1"); - set_timeout = 0; - } - } -} - -static int thread(void *unused) -{ - printk(KERN_NOTICE "Voyager starting monitor thread\n"); - - for (;;) { - set_current_state(TASK_INTERRUPTIBLE); - schedule_timeout(set_timeout ? HZ : MAX_SCHEDULE_TIMEOUT); - - VDEBUG(("Voyager Daemon awoken\n")); - if (voyager_status.request_from_kernel == 0) { - /* probably awoken from timeout */ - check_continuing_condition(); - } else { - check_from_kernel(); - voyager_status.request_from_kernel = 0; - } - } -} - -static int __init voyager_thread_start(void) -{ - voyager_thread = kthread_run(thread, NULL, "kvoyagerd"); - if (IS_ERR(voyager_thread)) { - printk(KERN_ERR - "Voyager: Failed to create system monitor thread.\n"); - return PTR_ERR(voyager_thread); - } - return 0; -} - -static void __exit voyager_thread_stop(void) -{ - kthread_stop(voyager_thread); -} - -module_init(voyager_thread_start); -module_exit(voyager_thread_stop); Index: linux-2.6-tip/arch/x86/math-emu/get_address.c =================================================================== --- linux-2.6-tip.orig/arch/x86/math-emu/get_address.c +++ linux-2.6-tip/arch/x86/math-emu/get_address.c @@ -150,11 +150,9 @@ static long pm_address(u_char FPU_modrm, #endif /* PARANOID */ switch (segment) { - /* gs isn't used by the kernel, so it still has its - user-space value. */ case PREFIX_GS_ - 1: - /* N.B. - movl %seg, mem is a 2 byte write regardless of prefix */ - savesegment(gs, addr->selector); + /* user gs handling can be lazy, use special accessors */ + addr->selector = get_user_gs(FPU_info->regs); break; default: addr->selector = PM_REG_(segment); Index: linux-2.6-tip/arch/x86/mm/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/Makefile +++ linux-2.6-tip/arch/x86/mm/Makefile @@ -1,6 +1,8 @@ obj-y := init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ pat.o pgtable.o gup.o +obj-$(CONFIG_SMP) += tlb.o + obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o @@ -8,6 +10,8 @@ obj-$(CONFIG_X86_PTDUMP) += dump_pagetab obj-$(CONFIG_HIGHMEM) += highmem_32.o +obj-$(CONFIG_KMEMCHECK) += kmemcheck/ + obj-$(CONFIG_MMIOTRACE) += mmiotrace.o mmiotrace-y := kmmio.o pf_in.o mmio-mod.o obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o Index: linux-2.6-tip/arch/x86/mm/extable.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/extable.c +++ linux-2.6-tip/arch/x86/mm/extable.c @@ -23,6 +23,12 @@ int fixup_exception(struct pt_regs *regs fixup = search_exception_tables(regs->ip); if (fixup) { + /* If fixup is less than 16, it means uaccess error */ + if (fixup->fixup < 16) { + current_thread_info()->uaccess_err = -EFAULT; + regs->ip += fixup->fixup; + return 1; + } regs->ip = fixup->fixup; return 1; } Index: linux-2.6-tip/arch/x86/mm/fault.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/fault.c +++ linux-2.6-tip/arch/x86/mm/fault.c @@ -1,73 +1,80 @@ /* * Copyright (C) 1995 Linus Torvalds - * Copyright (C) 2001,2002 Andi Kleen, SuSE Labs. + * Copyright (C) 2001, 2002 Andi Kleen, SuSE Labs. + * Copyright (C) 2008-2009, Red Hat Inc., Ingo Molnar */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include #include -#include -#include -#include /* For unblank_screen() */ +#include +#include #include #include -#include /* for max_low_pfn */ -#include -#include #include #include +#include +#include +#include +#include +#include +#include +#include #include +#include +#include +#include +#include +#include +#include +#include +#include +#include -#include -#include -#include -#include -#include +#include + +#include #include +#include +#include +#include #include -#include #include +#include /* - * Page fault error code bits - * bit 0 == 0 means no page found, 1 means protection fault - * bit 1 == 0 means read, 1 means write - * bit 2 == 0 means kernel, 1 means user-mode - * bit 3 == 1 means use of reserved bit detected - * bit 4 == 1 means fault was an instruction fetch - */ -#define PF_PROT (1<<0) -#define PF_WRITE (1<<1) -#define PF_USER (1<<2) -#define PF_RSVD (1<<3) -#define PF_INSTR (1<<4) + * Page fault error code bits: + * + * bit 0 == 0: no page found 1: protection fault + * bit 1 == 0: read access 1: write access + * bit 2 == 0: kernel-mode access 1: user-mode access + * bit 3 == 1: use of reserved bit detected + * bit 4 == 1: fault was an instruction fetch + */ +enum x86_pf_error_code { + + PF_PROT = 1 << 0, + PF_WRITE = 1 << 1, + PF_USER = 1 << 2, + PF_RSVD = 1 << 3, + PF_INSTR = 1 << 4, +}; +/* + * Returns 0 if mmiotrace is disabled, or if the fault is not + * handled by mmiotrace: + */ static inline int kmmio_fault(struct pt_regs *regs, unsigned long addr) { -#ifdef CONFIG_MMIOTRACE if (unlikely(is_kmmio_active())) if (kmmio_handler(regs, addr) == 1) return -1; -#endif return 0; } static inline int notify_page_fault(struct pt_regs *regs) { -#ifdef CONFIG_KPROBES int ret = 0; /* kprobe_running() needs smp_processor_id() */ - if (!user_mode_vm(regs)) { + if (kprobes_built_in() && !user_mode_vm(regs)) { preempt_disable(); if (kprobe_running() && kprobe_fault_handler(regs, 14)) ret = 1; @@ -75,29 +82,76 @@ static inline int notify_page_fault(stru } return ret; -#else - return 0; -#endif } /* - * X86_32 - * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch. - * Check that here and ignore it. - * - * X86_64 - * Sometimes the CPU reports invalid exceptions on prefetch. - * Check that here and ignore it. + * Prefetch quirks: + * + * 32-bit mode: + * + * Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch. + * Check that here and ignore it. + * + * 64-bit mode: + * + * Sometimes the CPU reports invalid exceptions on prefetch. + * Check that here and ignore it. * - * Opcode checker based on code by Richard Brunner + * Opcode checker based on code by Richard Brunner. */ -static int is_prefetch(struct pt_regs *regs, unsigned long addr, - unsigned long error_code) +static inline int +check_prefetch_opcode(struct pt_regs *regs, unsigned char *instr, + unsigned char opcode, int *prefetch) { + unsigned char instr_hi = opcode & 0xf0; + unsigned char instr_lo = opcode & 0x0f; + + switch (instr_hi) { + case 0x20: + case 0x30: + /* + * Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes. + * In X86_64 long mode, the CPU will signal invalid + * opcode if some of these prefixes are present so + * X86_64 will never get here anyway + */ + return ((instr_lo & 7) == 0x6); +#ifdef CONFIG_X86_64 + case 0x40: + /* + * In AMD64 long mode 0x40..0x4F are valid REX prefixes + * Need to figure out under what instruction mode the + * instruction was issued. Could check the LDT for lm, + * but for now it's good enough to assume that long + * mode only uses well known segments or kernel. + */ + return (!user_mode(regs)) || (regs->cs == __USER_CS); +#endif + case 0x60: + /* 0x64 thru 0x67 are valid prefixes in all modes. */ + return (instr_lo & 0xC) == 0x4; + case 0xF0: + /* 0xF0, 0xF2, 0xF3 are valid prefixes in all modes. */ + return !instr_lo || (instr_lo>>1) == 1; + case 0x00: + /* Prefetch instruction is 0x0F0D or 0x0F18 */ + if (probe_kernel_address(instr, opcode)) + return 0; + + *prefetch = (instr_lo == 0xF) && + (opcode == 0x0D || opcode == 0x18); + return 0; + default: + return 0; + } +} + +static int +is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr) +{ + unsigned char *max_instr; unsigned char *instr; - int scan_more = 1; int prefetch = 0; - unsigned char *max_instr; /* * If it was a exec (instruction fetch) fault on NX page, then @@ -106,106 +160,170 @@ static int is_prefetch(struct pt_regs *r if (error_code & PF_INSTR) return 0; - instr = (unsigned char *)convert_ip_to_linear(current, regs); + instr = (void *)convert_ip_to_linear(current, regs); max_instr = instr + 15; if (user_mode(regs) && instr >= (unsigned char *)TASK_SIZE) return 0; - while (scan_more && instr < max_instr) { + while (instr < max_instr) { unsigned char opcode; - unsigned char instr_hi; - unsigned char instr_lo; if (probe_kernel_address(instr, opcode)) break; - instr_hi = opcode & 0xf0; - instr_lo = opcode & 0x0f; instr++; - switch (instr_hi) { - case 0x20: - case 0x30: - /* - * Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes. - * In X86_64 long mode, the CPU will signal invalid - * opcode if some of these prefixes are present so - * X86_64 will never get here anyway - */ - scan_more = ((instr_lo & 7) == 0x6); - break; -#ifdef CONFIG_X86_64 - case 0x40: - /* - * In AMD64 long mode 0x40..0x4F are valid REX prefixes - * Need to figure out under what instruction mode the - * instruction was issued. Could check the LDT for lm, - * but for now it's good enough to assume that long - * mode only uses well known segments or kernel. - */ - scan_more = (!user_mode(regs)) || (regs->cs == __USER_CS); + if (!check_prefetch_opcode(regs, instr, opcode, &prefetch)) break; -#endif - case 0x60: - /* 0x64 thru 0x67 are valid prefixes in all modes. */ - scan_more = (instr_lo & 0xC) == 0x4; - break; - case 0xF0: - /* 0xF0, 0xF2, 0xF3 are valid prefixes in all modes. */ - scan_more = !instr_lo || (instr_lo>>1) == 1; - break; - case 0x00: - /* Prefetch instruction is 0x0F0D or 0x0F18 */ - scan_more = 0; - - if (probe_kernel_address(instr, opcode)) - break; - prefetch = (instr_lo == 0xF) && - (opcode == 0x0D || opcode == 0x18); - break; - default: - scan_more = 0; - break; - } } return prefetch; } -static void force_sig_info_fault(int si_signo, int si_code, - unsigned long address, struct task_struct *tsk) +static void +force_sig_info_fault(int si_signo, int si_code, unsigned long address, + struct task_struct *tsk) { siginfo_t info; - info.si_signo = si_signo; - info.si_errno = 0; - info.si_code = si_code; - info.si_addr = (void __user *)address; + info.si_signo = si_signo; + info.si_errno = 0; + info.si_code = si_code; + info.si_addr = (void __user *)address; + force_sig_info(si_signo, &info, tsk); } -#ifdef CONFIG_X86_64 -static int bad_address(void *p) +DEFINE_SPINLOCK(pgd_lock); +LIST_HEAD(pgd_list); + +#ifdef CONFIG_X86_32 +static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) { - unsigned long dummy; - return probe_kernel_address((unsigned long *)p, dummy); + unsigned index = pgd_index(address); + pgd_t *pgd_k; + pud_t *pud, *pud_k; + pmd_t *pmd, *pmd_k; + + pgd += index; + pgd_k = init_mm.pgd + index; + + if (!pgd_present(*pgd_k)) + return NULL; + + /* + * set_pgd(pgd, *pgd_k); here would be useless on PAE + * and redundant with the set_pmd() on non-PAE. As would + * set_pud. + */ + pud = pud_offset(pgd, address); + pud_k = pud_offset(pgd_k, address); + if (!pud_present(*pud_k)) + return NULL; + + pmd = pmd_offset(pud, address); + pmd_k = pmd_offset(pud_k, address); + if (!pmd_present(*pmd_k)) + return NULL; + + if (!pmd_present(*pmd)) { + set_pmd(pmd, *pmd_k); + arch_flush_lazy_mmu_mode(); + } else { + BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k)); + } + + return pmd_k; +} + +void vmalloc_sync_all(void) +{ + unsigned long address; + + if (SHARED_KERNEL_PMD) + return; + + for (address = VMALLOC_START & PMD_MASK; + address >= TASK_SIZE && address < FIXADDR_TOP; + address += PMD_SIZE) { + + unsigned long flags; + struct page *page; + + spin_lock_irqsave(&pgd_lock, flags); + list_for_each_entry(page, &pgd_list, lru) { + if (!vmalloc_sync_one(page_address(page), address)) + break; + } + spin_unlock_irqrestore(&pgd_lock, flags); + } +} + +/* + * 32-bit: + * + * Handle a fault on the vmalloc or module mapping area + */ +static noinline int vmalloc_fault(unsigned long address) +{ + unsigned long pgd_paddr; + pmd_t *pmd_k; + pte_t *pte_k; + + /* Make sure we are in vmalloc area: */ + if (!(address >= VMALLOC_START && address < VMALLOC_END)) + return -1; + + /* + * Synchronize this task's top level page-table + * with the 'reference' page table. + * + * Do _not_ use "current" here. We might be inside + * an interrupt in the middle of a task switch.. + */ + pgd_paddr = read_cr3(); + pmd_k = vmalloc_sync_one(__va(pgd_paddr), address); + if (!pmd_k) + return -1; + + pte_k = pte_offset_kernel(pmd_k, address); + if (!pte_present(*pte_k)) + return -1; + + return 0; +} + +/* + * Did it hit the DOS screen memory VA from vm86 mode? + */ +static inline void +check_v8086_mode(struct pt_regs *regs, unsigned long address, + struct task_struct *tsk) +{ + unsigned long bit; + + if (!v8086_mode(regs)) + return; + + bit = (address - 0xA0000) >> PAGE_SHIFT; + if (bit < 32) + tsk->thread.screen_bitmap |= 1 << bit; } -#endif static void dump_pagetable(unsigned long address) { -#ifdef CONFIG_X86_32 __typeof__(pte_val(__pte(0))) page; page = read_cr3(); page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT]; + #ifdef CONFIG_X86_PAE printk("*pdpt = %016Lx ", page); if ((page >> PAGE_SHIFT) < max_low_pfn && page & _PAGE_PRESENT) { page &= PAGE_MASK; page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT) - & (PTRS_PER_PMD - 1)]; + & (PTRS_PER_PMD - 1)]; printk(KERN_CONT "*pde = %016Lx ", page); page &= ~_PAGE_NX; } @@ -217,19 +335,145 @@ static void dump_pagetable(unsigned long * We must not directly access the pte in the highpte * case if the page table is located in highmem. * And let's rather not kmap-atomic the pte, just in case - * it's allocated already. + * it's allocated already: */ if ((page >> PAGE_SHIFT) < max_low_pfn && (page & _PAGE_PRESENT) && !(page & _PAGE_PSE)) { + page &= PAGE_MASK; page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT) - & (PTRS_PER_PTE - 1)]; + & (PTRS_PER_PTE - 1)]; printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page); } printk("\n"); -#else /* CONFIG_X86_64 */ +} + +#else /* CONFIG_X86_64: */ + +void vmalloc_sync_all(void) +{ + unsigned long address; + + for (address = VMALLOC_START & PGDIR_MASK; address <= VMALLOC_END; + address += PGDIR_SIZE) { + + const pgd_t *pgd_ref = pgd_offset_k(address); + unsigned long flags; + struct page *page; + + if (pgd_none(*pgd_ref)) + continue; + + spin_lock_irqsave(&pgd_lock, flags); + list_for_each_entry(page, &pgd_list, lru) { + pgd_t *pgd; + pgd = (pgd_t *)page_address(page) + pgd_index(address); + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); + else + BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); + } + spin_unlock_irqrestore(&pgd_lock, flags); + } +} + +/* + * 64-bit: + * + * Handle a fault on the vmalloc area + * + * This assumes no large pages in there. + */ +static noinline int vmalloc_fault(unsigned long address) +{ + pgd_t *pgd, *pgd_ref; + pud_t *pud, *pud_ref; + pmd_t *pmd, *pmd_ref; + pte_t *pte, *pte_ref; + + /* Make sure we are in vmalloc area: */ + if (!(address >= VMALLOC_START && address < VMALLOC_END)) + return -1; + + /* + * Copy kernel mappings over when needed. This can also + * happen within a race in page table update. In the later + * case just flush: + */ + pgd = pgd_offset(current->active_mm, address); + pgd_ref = pgd_offset_k(address); + if (pgd_none(*pgd_ref)) + return -1; + + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); + else + BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); + + /* + * Below here mismatches are bugs because these lower tables + * are shared: + */ + + pud = pud_offset(pgd, address); + pud_ref = pud_offset(pgd_ref, address); + if (pud_none(*pud_ref)) + return -1; + + if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref)) + BUG(); + + pmd = pmd_offset(pud, address); + pmd_ref = pmd_offset(pud_ref, address); + if (pmd_none(*pmd_ref)) + return -1; + + if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref)) + BUG(); + + pte_ref = pte_offset_kernel(pmd_ref, address); + if (!pte_present(*pte_ref)) + return -1; + + pte = pte_offset_kernel(pmd, address); + + /* + * Don't use pte_page here, because the mappings can point + * outside mem_map, and the NUMA hash lookup cannot handle + * that: + */ + if (!pte_present(*pte) || pte_pfn(*pte) != pte_pfn(*pte_ref)) + BUG(); + + return 0; +} + +static const char errata93_warning[] = +KERN_ERR "******* Your BIOS seems to not contain a fix for K8 errata #93\n" +KERN_ERR "******* Working around it, but it may cause SEGVs or burn power.\n" +KERN_ERR "******* Please consider a BIOS update.\n" +KERN_ERR "******* Disabling USB legacy in the BIOS may also help.\n"; + +/* + * No vm86 mode in 64-bit mode: + */ +static inline void +check_v8086_mode(struct pt_regs *regs, unsigned long address, + struct task_struct *tsk) +{ +} + +static int bad_address(void *p) +{ + unsigned long dummy; + + return probe_kernel_address((unsigned long *)p, dummy); +} + +static void dump_pagetable(unsigned long address) +{ pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -238,102 +482,77 @@ static void dump_pagetable(unsigned long pgd = (pgd_t *)read_cr3(); pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK); + pgd += pgd_index(address); - if (bad_address(pgd)) goto bad; + if (bad_address(pgd)) + goto bad; + printk("PGD %lx ", pgd_val(*pgd)); - if (!pgd_present(*pgd)) goto ret; + + if (!pgd_present(*pgd)) + goto out; pud = pud_offset(pgd, address); - if (bad_address(pud)) goto bad; + if (bad_address(pud)) + goto bad; + printk("PUD %lx ", pud_val(*pud)); if (!pud_present(*pud) || pud_large(*pud)) - goto ret; + goto out; pmd = pmd_offset(pud, address); - if (bad_address(pmd)) goto bad; + if (bad_address(pmd)) + goto bad; + printk("PMD %lx ", pmd_val(*pmd)); - if (!pmd_present(*pmd) || pmd_large(*pmd)) goto ret; + if (!pmd_present(*pmd) || pmd_large(*pmd)) + goto out; pte = pte_offset_kernel(pmd, address); - if (bad_address(pte)) goto bad; + if (bad_address(pte)) + goto bad; + printk("PTE %lx", pte_val(*pte)); -ret: +out: printk("\n"); return; bad: printk("BAD\n"); -#endif } -#ifdef CONFIG_X86_32 -static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) -{ - unsigned index = pgd_index(address); - pgd_t *pgd_k; - pud_t *pud, *pud_k; - pmd_t *pmd, *pmd_k; - - pgd += index; - pgd_k = init_mm.pgd + index; +#endif /* CONFIG_X86_64 */ - if (!pgd_present(*pgd_k)) - return NULL; - - /* - * set_pgd(pgd, *pgd_k); here would be useless on PAE - * and redundant with the set_pmd() on non-PAE. As would - * set_pud. - */ - - pud = pud_offset(pgd, address); - pud_k = pud_offset(pgd_k, address); - if (!pud_present(*pud_k)) - return NULL; - - pmd = pmd_offset(pud, address); - pmd_k = pmd_offset(pud_k, address); - if (!pmd_present(*pmd_k)) - return NULL; - if (!pmd_present(*pmd)) { - set_pmd(pmd, *pmd_k); - arch_flush_lazy_mmu_mode(); - } else - BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k)); - return pmd_k; -} -#endif - -#ifdef CONFIG_X86_64 -static const char errata93_warning[] = -KERN_ERR "******* Your BIOS seems to not contain a fix for K8 errata #93\n" -KERN_ERR "******* Working around it, but it may cause SEGVs or burn power.\n" -KERN_ERR "******* Please consider a BIOS update.\n" -KERN_ERR "******* Disabling USB legacy in the BIOS may also help.\n"; -#endif - -/* Workaround for K8 erratum #93 & buggy BIOS. - BIOS SMM functions are required to use a specific workaround - to avoid corruption of the 64bit RIP register on C stepping K8. - A lot of BIOS that didn't get tested properly miss this. - The OS sees this as a page fault with the upper 32bits of RIP cleared. - Try to work around it here. - Note we only handle faults in kernel here. - Does nothing for X86_32 +/* + * Workaround for K8 erratum #93 & buggy BIOS. + * + * BIOS SMM functions are required to use a specific workaround + * to avoid corruption of the 64bit RIP register on C stepping K8. + * + * A lot of BIOS that didn't get tested properly miss this. + * + * The OS sees this as a page fault with the upper 32bits of RIP cleared. + * Try to work around it here. + * + * Note we only handle faults in kernel here. + * Does nothing on 32-bit. */ static int is_errata93(struct pt_regs *regs, unsigned long address) { #ifdef CONFIG_X86_64 - static int warned; + static int once; + if (address != regs->ip) return 0; + if ((address >> 32) != 0) return 0; + address |= 0xffffffffUL << 32; if ((address >= (u64)_stext && address <= (u64)_etext) || (address >= MODULES_VADDR && address <= MODULES_END)) { - if (!warned) { + if (!once) { printk(errata93_warning); - warned = 1; + once = 1; } regs->ip = address; return 1; @@ -343,16 +562,17 @@ static int is_errata93(struct pt_regs *r } /* - * Work around K8 erratum #100 K8 in compat mode occasionally jumps to illegal - * addresses >4GB. We catch this in the page fault handler because these - * addresses are not reachable. Just detect this case and return. Any code + * Work around K8 erratum #100 K8 in compat mode occasionally jumps + * to illegal addresses >4GB. + * + * We catch this in the page fault handler because these addresses + * are not reachable. Just detect this case and return. Any code * segment in LDT is compatibility mode. */ static int is_errata100(struct pt_regs *regs, unsigned long address) { #ifdef CONFIG_X86_64 - if ((regs->cs == __USER32_CS || (regs->cs & (1<<2))) && - (address >> 32)) + if ((regs->cs == __USER32_CS || (regs->cs & (1<<2))) && (address >> 32)) return 1; #endif return 0; @@ -362,13 +582,15 @@ static int is_f00f_bug(struct pt_regs *r { #ifdef CONFIG_X86_F00F_BUG unsigned long nr; + /* - * Pentium F0 0F C7 C8 bug workaround. + * Pentium F0 0F C7 C8 bug workaround: */ if (boot_cpu_data.f00f_bug) { nr = (address - idt_descr.address) >> 3; if (nr == 6) { + zap_rt_locks(); do_invalid_op(regs, 0); return 1; } @@ -377,62 +599,277 @@ static int is_f00f_bug(struct pt_regs *r return 0; } -static void show_fault_oops(struct pt_regs *regs, unsigned long error_code, - unsigned long address) +static const char nx_warning[] = KERN_CRIT +"kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n"; + +static void +show_fault_oops(struct pt_regs *regs, unsigned long error_code, + unsigned long address) { -#ifdef CONFIG_X86_32 if (!oops_may_print()) return; -#endif -#ifdef CONFIG_X86_PAE if (error_code & PF_INSTR) { unsigned int level; + pte_t *pte = lookup_address(address, &level); if (pte && pte_present(*pte) && !pte_exec(*pte)) - printk(KERN_CRIT "kernel tried to execute " - "NX-protected page - exploit attempt? " - "(uid: %d)\n", current_uid()); + printk(nx_warning, current_uid()); } -#endif printk(KERN_ALERT "BUG: unable to handle kernel "); if (address < PAGE_SIZE) printk(KERN_CONT "NULL pointer dereference"); else printk(KERN_CONT "paging request"); + printk(KERN_CONT " at %p\n", (void *) address); printk(KERN_ALERT "IP:"); printk_address(regs->ip, 1); + dump_pagetable(address); } -#ifdef CONFIG_X86_64 -static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs, - unsigned long error_code) +static noinline void +pgtable_bad(struct pt_regs *regs, unsigned long error_code, + unsigned long address) { - unsigned long flags = oops_begin(); - int sig = SIGKILL; struct task_struct *tsk; + unsigned long flags; + int sig; + + flags = oops_begin(); + tsk = current; + sig = SIGKILL; printk(KERN_ALERT "%s: Corrupted page table at address %lx\n", - current->comm, address); + tsk->comm, address); dump_pagetable(address); - tsk = current; - tsk->thread.cr2 = address; - tsk->thread.trap_no = 14; - tsk->thread.error_code = error_code; + + tsk->thread.cr2 = address; + tsk->thread.trap_no = 14; + tsk->thread.error_code = error_code; + if (__die("Bad pagetable", regs, error_code)) sig = 0; + oops_end(flags, regs, sig); } -#endif + +static noinline void +no_context(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + struct task_struct *tsk = current; + unsigned long *stackend; + unsigned long flags; + int sig; + + /* Are we prepared to handle this kernel fault? */ + if (fixup_exception(regs)) + return; + + /* + * 32-bit: + * + * Valid to do another page fault here, because if this fault + * had been triggered by is_prefetch fixup_exception would have + * handled it. + * + * 64-bit: + * + * Hall of shame of CPU/BIOS bugs. + */ + if (is_prefetch(regs, error_code, address)) + return; + + if (is_errata93(regs, address)) + return; + + /* + * Oops. The kernel tried to access some bad page. We'll have to + * terminate things with extreme prejudice: + */ + flags = oops_begin(); + + show_fault_oops(regs, error_code, address); + + stackend = end_of_stack(tsk); + if (*stackend != STACK_END_MAGIC) + printk(KERN_ALERT "Thread overran stack, or stack corrupted\n"); + + tsk->thread.cr2 = address; + tsk->thread.trap_no = 14; + tsk->thread.error_code = error_code; + + sig = SIGKILL; + if (__die("Oops", regs, error_code)) + sig = 0; + + /* Executive summary in case the body of the oops scrolled away */ + printk(KERN_EMERG "CR2: %016lx\n", address); + + oops_end(flags, regs, sig); +} + +/* + * Print out info about fatal segfaults, if the show_unhandled_signals + * sysctl is set: + */ +static inline void +show_signal_msg(struct pt_regs *regs, unsigned long error_code, + unsigned long address, struct task_struct *tsk) +{ + if (!unhandled_signal(tsk, SIGSEGV)) + return; + + if (!printk_ratelimit()) + return; + + printk(KERN_CONT "%s%s[%d]: segfault at %lx ip %p sp %p error %lx", + task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, + tsk->comm, task_pid_nr(tsk), address, + (void *)regs->ip, (void *)regs->sp, error_code); + + print_vma_addr(KERN_CONT " in ", regs->ip); + + printk(KERN_CONT "\n"); +} + +static void +__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, + unsigned long address, int si_code) +{ + struct task_struct *tsk = current; + + /* User mode accesses just cause a SIGSEGV */ + if (error_code & PF_USER) { + /* + * It's possible to have interrupts off here: + */ + local_irq_enable(); + + /* + * Valid to do another page fault here because this one came + * from user space: + */ + if (is_prefetch(regs, error_code, address)) + return; + + if (is_errata100(regs, address)) + return; + + if (unlikely(show_unhandled_signals)) + show_signal_msg(regs, error_code, address, tsk); + + /* Kernel addresses are always protection faults: */ + tsk->thread.cr2 = address; + tsk->thread.error_code = error_code | (address >= TASK_SIZE); + tsk->thread.trap_no = 14; + + force_sig_info_fault(SIGSEGV, si_code, address, tsk); + + return; + } + + if (is_f00f_bug(regs, address)) + return; + + no_context(regs, error_code, address); +} + +static noinline void +bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR); +} + +static void +__bad_area(struct pt_regs *regs, unsigned long error_code, + unsigned long address, int si_code) +{ + struct mm_struct *mm = current->mm; + + /* + * Something tried to access memory that isn't in our memory map.. + * Fix it, but check if it's kernel or user first.. + */ + up_read(&mm->mmap_sem); + + __bad_area_nosemaphore(regs, error_code, address, si_code); +} + +static noinline void +bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address) +{ + __bad_area(regs, error_code, address, SEGV_MAPERR); +} + +static noinline void +bad_area_access_error(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + __bad_area(regs, error_code, address, SEGV_ACCERR); +} + +/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */ +static void +out_of_memory(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + /* + * We ran out of memory, call the OOM killer, and return the userspace + * (which will retry the fault, or kill us if we got oom-killed): + */ + up_read(¤t->mm->mmap_sem); + + pagefault_out_of_memory(); +} + +static void +do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address) +{ + struct task_struct *tsk = current; + struct mm_struct *mm = tsk->mm; + + up_read(&mm->mmap_sem); + + /* Kernel mode? Handle exceptions or die: */ + if (!(error_code & PF_USER)) + no_context(regs, error_code, address); + + /* User-space => ok to do another page fault: */ + if (is_prefetch(regs, error_code, address)) + return; + + tsk->thread.cr2 = address; + tsk->thread.error_code = error_code; + tsk->thread.trap_no = 14; + + force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); +} + +static noinline void +mm_fault_error(struct pt_regs *regs, unsigned long error_code, + unsigned long address, unsigned int fault) +{ + if (fault & VM_FAULT_OOM) { + out_of_memory(regs, error_code, address); + } else { + if (fault & VM_FAULT_SIGBUS) + do_sigbus(regs, error_code, address); + else + BUG(); + } +} static int spurious_fault_check(unsigned long error_code, pte_t *pte) { if ((error_code & PF_WRITE) && !pte_write(*pte)) return 0; + if ((error_code & PF_INSTR) && !pte_exec(*pte)) return 0; @@ -440,21 +877,25 @@ static int spurious_fault_check(unsigned } /* - * Handle a spurious fault caused by a stale TLB entry. This allows - * us to lazily refresh the TLB when increasing the permissions of a - * kernel page (RO -> RW or NX -> X). Doing it eagerly is very - * expensive since that implies doing a full cross-processor TLB - * flush, even if no stale TLB entries exist on other processors. + * Handle a spurious fault caused by a stale TLB entry. + * + * This allows us to lazily refresh the TLB when increasing the + * permissions of a kernel page (RO -> RW or NX -> X). Doing it + * eagerly is very expensive since that implies doing a full + * cross-processor TLB flush, even if no stale TLB entries exist + * on other processors. + * * There are no security implications to leaving a stale TLB when * increasing the permissions on a page. */ -static int spurious_fault(unsigned long address, - unsigned long error_code) +static noinline int +spurious_fault(unsigned long error_code, unsigned long address) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte; + int ret; /* Reserved-bit violation or user access to kernel space? */ if (error_code & (PF_USER | PF_RSVD)) @@ -482,126 +923,77 @@ static int spurious_fault(unsigned long if (!pte_present(*pte)) return 0; - return spurious_fault_check(error_code, pte); -} - -/* - * X86_32 - * Handle a fault on the vmalloc or module mapping area - * - * X86_64 - * Handle a fault on the vmalloc area - * - * This assumes no large pages in there. - */ -static int vmalloc_fault(unsigned long address) -{ -#ifdef CONFIG_X86_32 - unsigned long pgd_paddr; - pmd_t *pmd_k; - pte_t *pte_k; - - /* Make sure we are in vmalloc area */ - if (!(address >= VMALLOC_START && address < VMALLOC_END)) - return -1; + ret = spurious_fault_check(error_code, pte); + if (!ret) + return 0; /* - * Synchronize this task's top level page-table - * with the 'reference' page table. - * - * Do _not_ use "current" here. We might be inside - * an interrupt in the middle of a task switch.. + * Make sure we have permissions in PMD. + * If not, then there's a bug in the page tables: */ - pgd_paddr = read_cr3(); - pmd_k = vmalloc_sync_one(__va(pgd_paddr), address); - if (!pmd_k) - return -1; - pte_k = pte_offset_kernel(pmd_k, address); - if (!pte_present(*pte_k)) - return -1; - return 0; -#else - pgd_t *pgd, *pgd_ref; - pud_t *pud, *pud_ref; - pmd_t *pmd, *pmd_ref; - pte_t *pte, *pte_ref; + ret = spurious_fault_check(error_code, (pte_t *) pmd); + WARN_ONCE(!ret, "PMD has incorrect permission bits\n"); - /* Make sure we are in vmalloc area */ - if (!(address >= VMALLOC_START && address < VMALLOC_END)) - return -1; + return ret; +} - /* Copy kernel mappings over when needed. This can also - happen within a race in page table update. In the later - case just flush. */ +int show_unhandled_signals = 1; - pgd = pgd_offset(current->active_mm, address); - pgd_ref = pgd_offset_k(address); - if (pgd_none(*pgd_ref)) - return -1; - if (pgd_none(*pgd)) - set_pgd(pgd, *pgd_ref); - else - BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); +static inline int +access_error(unsigned long error_code, int write, struct vm_area_struct *vma) +{ + if (write) { + /* write, present and write, not present: */ + if (unlikely(!(vma->vm_flags & VM_WRITE))) + return 1; + return 0; + } - /* Below here mismatches are bugs because these lower tables - are shared */ + /* read, present: */ + if (unlikely(error_code & PF_PROT)) + return 1; + + /* read, not present: */ + if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + return 1; - pud = pud_offset(pgd, address); - pud_ref = pud_offset(pgd_ref, address); - if (pud_none(*pud_ref)) - return -1; - if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref)) - BUG(); - pmd = pmd_offset(pud, address); - pmd_ref = pmd_offset(pud_ref, address); - if (pmd_none(*pmd_ref)) - return -1; - if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref)) - BUG(); - pte_ref = pte_offset_kernel(pmd_ref, address); - if (!pte_present(*pte_ref)) - return -1; - pte = pte_offset_kernel(pmd, address); - /* Don't use pte_page here, because the mappings can point - outside mem_map, and the NUMA hash lookup cannot handle - that. */ - if (!pte_present(*pte) || pte_pfn(*pte) != pte_pfn(*pte_ref)) - BUG(); return 0; -#endif } -int show_unhandled_signals = 1; +static int fault_in_kernel_space(unsigned long address) +{ + return address >= TASK_SIZE_MAX; +} /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. */ -#ifdef CONFIG_X86_64 -asmlinkage -#endif -void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) +dotraplinkage void __kprobes +do_page_fault(struct pt_regs *regs, unsigned long error_code) { - struct task_struct *tsk; - struct mm_struct *mm; struct vm_area_struct *vma; + struct task_struct *tsk; unsigned long address; - int write, si_code; + struct mm_struct *mm; + int write; int fault; -#ifdef CONFIG_X86_64 - unsigned long flags; - int sig; -#endif tsk = current; mm = tsk->mm; + prefetchw(&mm->mmap_sem); - /* get the address */ + /* Get the faulting address: */ address = read_cr2(); - si_code = SEGV_MAPERR; + /* + * Detect and handle instructions that would cause a page fault for + * both a tracked kernel page and a userspace page. + */ + if(kmemcheck_active(regs)) + kmemcheck_hide(regs); if (unlikely(kmmio_fault(regs, address))) return; @@ -619,319 +1011,151 @@ void __kprobes do_page_fault(struct pt_r * (error_code & 4) == 0, and that the fault was not a * protection error (error_code & 9) == 0. */ -#ifdef CONFIG_X86_32 - if (unlikely(address >= TASK_SIZE)) { -#else - if (unlikely(address >= TASK_SIZE64)) { -#endif - if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) && - vmalloc_fault(address) >= 0) - return; + if (unlikely(fault_in_kernel_space(address))) { + if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) { + if (vmalloc_fault(address) >= 0) + return; + + if (kmemcheck_fault(regs, address, error_code)) + return; + } - /* Can handle a stale RO->RW TLB */ - if (spurious_fault(address, error_code)) + /* Can handle a stale RO->RW TLB: */ + if (spurious_fault(error_code, address)) return; - /* kprobes don't want to hook the spurious faults. */ + /* kprobes don't want to hook the spurious faults: */ if (notify_page_fault(regs)) return; /* * Don't take the mm semaphore here. If we fixup a prefetch - * fault we could otherwise deadlock. + * fault we could otherwise deadlock: */ - goto bad_area_nosemaphore; - } + bad_area_nosemaphore(regs, error_code, address); - /* kprobes don't want to hook the spurious faults. */ - if (notify_page_fault(regs)) return; + } + /* kprobes don't want to hook the spurious faults: */ + if (unlikely(notify_page_fault(regs))) + return; /* * It's safe to allow irq's after cr2 has been saved and the * vmalloc fault has been handled. * * User-mode registers count as a user access even for any - * potential system fault or CPU buglet. + * potential system fault or CPU buglet: */ if (user_mode_vm(regs)) { local_irq_enable(); error_code |= PF_USER; - } else if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); + } else { + if (regs->flags & X86_EFLAGS_IF) + local_irq_enable(); + } -#ifdef CONFIG_X86_64 if (unlikely(error_code & PF_RSVD)) - pgtable_bad(address, regs, error_code); -#endif + pgtable_bad(regs, error_code, address); /* - * If we're in an interrupt, have no user context or are running in an - * atomic region then we must not take the fault. + * If we're in an interrupt, have no user context or are running + * in an atomic region then we must not take the fault: */ - if (unlikely(in_atomic() || !mm)) - goto bad_area_nosemaphore; + if (unlikely(in_atomic() || !mm || current->pagefault_disabled)) { + bad_area_nosemaphore(regs, error_code, address); + return; + } /* * When running in the kernel we expect faults to occur only to - * addresses in user space. All other faults represent errors in the - * kernel and should generate an OOPS. Unfortunately, in the case of an - * erroneous fault occurring in a code path which already holds mmap_sem - * we will deadlock attempting to validate the fault against the - * address space. Luckily the kernel only validly references user - * space from well defined areas of code, which are listed in the - * exceptions table. + * addresses in user space. All other faults represent errors in + * the kernel and should generate an OOPS. Unfortunately, in the + * case of an erroneous fault occurring in a code path which already + * holds mmap_sem we will deadlock attempting to validate the fault + * against the address space. Luckily the kernel only validly + * references user space from well defined areas of code, which are + * listed in the exceptions table. * * As the vast majority of faults will be valid we will only perform - * the source reference check when there is a possibility of a deadlock. - * Attempt to lock the address space, if we cannot we then validate the - * source. If this is invalid we can skip the address space check, - * thus avoiding the deadlock. + * the source reference check when there is a possibility of a + * deadlock. Attempt to lock the address space, if we cannot we then + * validate the source. If this is invalid we can skip the address + * space check, thus avoiding the deadlock: */ - if (!down_read_trylock(&mm->mmap_sem)) { + if (unlikely(!down_read_trylock(&mm->mmap_sem))) { if ((error_code & PF_USER) == 0 && - !search_exception_tables(regs->ip)) - goto bad_area_nosemaphore; + !search_exception_tables(regs->ip)) { + bad_area_nosemaphore(regs, error_code, address); + return; + } down_read(&mm->mmap_sem); + } else { + /* + * The above down_read_trylock() might have succeeded in + * which case we'll have missed the might_sleep() from + * down_read(): + */ + might_sleep(); } vma = find_vma(mm, address); - if (!vma) - goto bad_area; - if (vma->vm_start <= address) + if (unlikely(!vma)) { + bad_area(regs, error_code, address); + return; + } + if (likely(vma->vm_start <= address)) goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { + bad_area(regs, error_code, address); + return; + } if (error_code & PF_USER) { /* * Accessing the stack below %sp is always a bug. * The large cushion allows instructions like enter - * and pusha to work. ("enter $65535,$31" pushes + * and pusha to work. ("enter $65535, $31" pushes * 32 pointers and then decrements %sp by 65535.) */ - if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp) - goto bad_area; + if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) { + bad_area(regs, error_code, address); + return; + } } - if (expand_stack(vma, address)) - goto bad_area; -/* - * Ok, we have a good vm_area for this memory access, so - * we can handle it.. - */ + if (unlikely(expand_stack(vma, address))) { + bad_area(regs, error_code, address); + return; + } + + /* + * Ok, we have a good vm_area for this memory access, so + * we can handle it.. + */ good_area: - si_code = SEGV_ACCERR; - write = 0; - switch (error_code & (PF_PROT|PF_WRITE)) { - default: /* 3: write, present */ - /* fall through */ - case PF_WRITE: /* write, not present */ - if (!(vma->vm_flags & VM_WRITE)) - goto bad_area; - write++; - break; - case PF_PROT: /* read, present */ - goto bad_area; - case 0: /* read, not present */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) - goto bad_area; + write = error_code & PF_WRITE; + + if (unlikely(access_error(error_code, write, vma))) { + bad_area_access_error(regs, error_code, address); + return; } /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo - * the fault. + * the fault: */ fault = handle_mm_fault(mm, vma, address, write); + if (unlikely(fault & VM_FAULT_ERROR)) { - if (fault & VM_FAULT_OOM) - goto out_of_memory; - else if (fault & VM_FAULT_SIGBUS) - goto do_sigbus; - BUG(); + mm_fault_error(regs, error_code, address, fault); + return; } + if (fault & VM_FAULT_MAJOR) tsk->maj_flt++; else tsk->min_flt++; -#ifdef CONFIG_X86_32 - /* - * Did it hit the DOS screen memory VA from vm86 mode? - */ - if (v8086_mode(regs)) { - unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT; - if (bit < 32) - tsk->thread.screen_bitmap |= 1 << bit; - } -#endif - up_read(&mm->mmap_sem); - return; - -/* - * Something tried to access memory that isn't in our memory map.. - * Fix it, but check if it's kernel or user first.. - */ -bad_area: - up_read(&mm->mmap_sem); - -bad_area_nosemaphore: - /* User mode accesses just cause a SIGSEGV */ - if (error_code & PF_USER) { - /* - * It's possible to have interrupts off here. - */ - local_irq_enable(); - - /* - * Valid to do another page fault here because this one came - * from user space. - */ - if (is_prefetch(regs, address, error_code)) - return; - - if (is_errata100(regs, address)) - return; - - if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && - printk_ratelimit()) { - printk( - "%s%s[%d]: segfault at %lx ip %p sp %p error %lx", - task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, - tsk->comm, task_pid_nr(tsk), address, - (void *) regs->ip, (void *) regs->sp, error_code); - print_vma_addr(" in ", regs->ip); - printk("\n"); - } - - tsk->thread.cr2 = address; - /* Kernel addresses are always protection faults */ - tsk->thread.error_code = error_code | (address >= TASK_SIZE); - tsk->thread.trap_no = 14; - force_sig_info_fault(SIGSEGV, si_code, address, tsk); - return; - } - - if (is_f00f_bug(regs, address)) - return; - -no_context: - /* Are we prepared to handle this kernel fault? */ - if (fixup_exception(regs)) - return; - - /* - * X86_32 - * Valid to do another page fault here, because if this fault - * had been triggered by is_prefetch fixup_exception would have - * handled it. - * - * X86_64 - * Hall of shame of CPU/BIOS bugs. - */ - if (is_prefetch(regs, address, error_code)) - return; - - if (is_errata93(regs, address)) - return; - -/* - * Oops. The kernel tried to access some bad page. We'll have to - * terminate things with extreme prejudice. - */ -#ifdef CONFIG_X86_32 - bust_spinlocks(1); -#else - flags = oops_begin(); -#endif - - show_fault_oops(regs, error_code, address); - - tsk->thread.cr2 = address; - tsk->thread.trap_no = 14; - tsk->thread.error_code = error_code; - -#ifdef CONFIG_X86_32 - die("Oops", regs, error_code); - bust_spinlocks(0); - do_exit(SIGKILL); -#else - sig = SIGKILL; - if (__die("Oops", regs, error_code)) - sig = 0; - /* Executive summary in case the body of the oops scrolled away */ - printk(KERN_EMERG "CR2: %016lx\n", address); - oops_end(flags, regs, sig); -#endif - -out_of_memory: - /* - * We ran out of memory, call the OOM killer, and return the userspace - * (which will retry the fault, or kill us if we got oom-killed). - */ - up_read(&mm->mmap_sem); - pagefault_out_of_memory(); - return; + check_v8086_mode(regs, address, tsk); -do_sigbus: up_read(&mm->mmap_sem); - - /* Kernel mode? Handle exceptions or die */ - if (!(error_code & PF_USER)) - goto no_context; -#ifdef CONFIG_X86_32 - /* User space => ok to do another page fault */ - if (is_prefetch(regs, address, error_code)) - return; -#endif - tsk->thread.cr2 = address; - tsk->thread.error_code = error_code; - tsk->thread.trap_no = 14; - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); -} - -DEFINE_SPINLOCK(pgd_lock); -LIST_HEAD(pgd_list); - -void vmalloc_sync_all(void) -{ - unsigned long address; - -#ifdef CONFIG_X86_32 - if (SHARED_KERNEL_PMD) - return; - - for (address = VMALLOC_START & PMD_MASK; - address >= TASK_SIZE && address < FIXADDR_TOP; - address += PMD_SIZE) { - unsigned long flags; - struct page *page; - - spin_lock_irqsave(&pgd_lock, flags); - list_for_each_entry(page, &pgd_list, lru) { - if (!vmalloc_sync_one(page_address(page), - address)) - break; - } - spin_unlock_irqrestore(&pgd_lock, flags); - } -#else /* CONFIG_X86_64 */ - for (address = VMALLOC_START & PGDIR_MASK; address <= VMALLOC_END; - address += PGDIR_SIZE) { - const pgd_t *pgd_ref = pgd_offset_k(address); - unsigned long flags; - struct page *page; - - if (pgd_none(*pgd_ref)) - continue; - spin_lock_irqsave(&pgd_lock, flags); - list_for_each_entry(page, &pgd_list, lru) { - pgd_t *pgd; - pgd = (pgd_t *)page_address(page) + pgd_index(address); - if (pgd_none(*pgd)) - set_pgd(pgd, *pgd_ref); - else - BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); - } - spin_unlock_irqrestore(&pgd_lock, flags); - } -#endif } Index: linux-2.6-tip/arch/x86/mm/init_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/init_32.c +++ linux-2.6-tip/arch/x86/mm/init_32.c @@ -49,14 +49,13 @@ #include #include #include -#include unsigned int __VMALLOC_RESERVE = 128 << 20; unsigned long max_low_pfn_mapped; unsigned long max_pfn_mapped; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; static noinline int do_test_wp_bit(void); @@ -121,7 +120,7 @@ static pte_t * __init one_page_table_ini pte_t *page_table = NULL; if (after_init_bootmem) { -#ifdef CONFIG_DEBUG_PAGEALLOC +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK) page_table = (pte_t *) alloc_bootmem_pages(PAGE_SIZE); #endif if (!page_table) @@ -675,75 +674,97 @@ static int __init parse_highmem(char *ar } early_param("highmem", parse_highmem); +#define MSG_HIGHMEM_TOO_BIG \ + "highmem size (%luMB) is bigger than pages available (%luMB)!\n" + +#define MSG_LOWMEM_TOO_SMALL \ + "highmem size (%luMB) results in <64MB lowmem, ignoring it!\n" /* - * Determine low and high memory ranges: + * All of RAM fits into lowmem - but if user wants highmem + * artificially via the highmem=x boot parameter then create + * it: */ -void __init find_low_pfn_range(void) +void __init lowmem_pfn_init(void) { - /* it could update max_pfn */ - /* max_low_pfn is 0, we already have early_res support */ - max_low_pfn = max_pfn; - if (max_low_pfn > MAXMEM_PFN) { - if (highmem_pages == -1) - highmem_pages = max_pfn - MAXMEM_PFN; - if (highmem_pages + MAXMEM_PFN < max_pfn) - max_pfn = MAXMEM_PFN + highmem_pages; - if (highmem_pages + MAXMEM_PFN > max_pfn) { - printk(KERN_WARNING "only %luMB highmem pages " - "available, ignoring highmem size of %uMB.\n", - pages_to_mb(max_pfn - MAXMEM_PFN), + + if (highmem_pages == -1) + highmem_pages = 0; +#ifdef CONFIG_HIGHMEM + if (highmem_pages >= max_pfn) { + printk(KERN_ERR MSG_HIGHMEM_TOO_BIG, + pages_to_mb(highmem_pages), pages_to_mb(max_pfn)); + highmem_pages = 0; + } + if (highmem_pages) { + if (max_low_pfn - highmem_pages < 64*1024*1024/PAGE_SIZE) { + printk(KERN_ERR MSG_LOWMEM_TOO_SMALL, pages_to_mb(highmem_pages)); highmem_pages = 0; } - max_low_pfn = MAXMEM_PFN; + max_low_pfn -= highmem_pages; + } +#else + if (highmem_pages) + printk(KERN_ERR "ignoring highmem size on non-highmem kernel!\n"); +#endif +} + +#define MSG_HIGHMEM_TOO_SMALL \ + "only %luMB highmem pages available, ignoring highmem size of %luMB!\n" + +#define MSG_HIGHMEM_TRIMMED \ + "Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n" +/* + * We have more RAM than fits into lowmem - we try to put it into + * highmem, also taking the highmem=x boot parameter into account: + */ +void __init highmem_pfn_init(void) +{ + max_low_pfn = MAXMEM_PFN; + + if (highmem_pages == -1) + highmem_pages = max_pfn - MAXMEM_PFN; + + if (highmem_pages + MAXMEM_PFN < max_pfn) + max_pfn = MAXMEM_PFN + highmem_pages; + + if (highmem_pages + MAXMEM_PFN > max_pfn) { + printk(KERN_WARNING MSG_HIGHMEM_TOO_SMALL, + pages_to_mb(max_pfn - MAXMEM_PFN), + pages_to_mb(highmem_pages)); + highmem_pages = 0; + } #ifndef CONFIG_HIGHMEM - /* Maximum memory usable is what is directly addressable */ - printk(KERN_WARNING "Warning only %ldMB will be used.\n", - MAXMEM>>20); - if (max_pfn > MAX_NONPAE_PFN) - printk(KERN_WARNING - "Use a HIGHMEM64G enabled kernel.\n"); - else - printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n"); - max_pfn = MAXMEM_PFN; + /* Maximum memory usable is what is directly addressable */ + printk(KERN_WARNING "Warning only %ldMB will be used.\n", MAXMEM>>20); + if (max_pfn > MAX_NONPAE_PFN) + printk(KERN_WARNING "Use a HIGHMEM64G enabled kernel.\n"); + else + printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n"); + max_pfn = MAXMEM_PFN; #else /* !CONFIG_HIGHMEM */ #ifndef CONFIG_HIGHMEM64G - if (max_pfn > MAX_NONPAE_PFN) { - max_pfn = MAX_NONPAE_PFN; - printk(KERN_WARNING "Warning only 4GB will be used." - "Use a HIGHMEM64G enabled kernel.\n"); - } + if (max_pfn > MAX_NONPAE_PFN) { + max_pfn = MAX_NONPAE_PFN; + printk(KERN_WARNING MSG_HIGHMEM_TRIMMED); + } #endif /* !CONFIG_HIGHMEM64G */ #endif /* !CONFIG_HIGHMEM */ - } else { - if (highmem_pages == -1) - highmem_pages = 0; -#ifdef CONFIG_HIGHMEM - if (highmem_pages >= max_pfn) { - printk(KERN_ERR "highmem size specified (%uMB) is " - "bigger than pages available (%luMB)!.\n", - pages_to_mb(highmem_pages), - pages_to_mb(max_pfn)); - highmem_pages = 0; - } - if (highmem_pages) { - if (max_low_pfn - highmem_pages < - 64*1024*1024/PAGE_SIZE){ - printk(KERN_ERR "highmem size %uMB results in " - "smaller than 64MB lowmem, ignoring it.\n" - , pages_to_mb(highmem_pages)); - highmem_pages = 0; - } - max_low_pfn -= highmem_pages; - } -#else - if (highmem_pages) - printk(KERN_ERR "ignoring highmem size on non-highmem" - " kernel!\n"); -#endif - } +} + +/* + * Determine low and high memory ranges: + */ +void __init find_low_pfn_range(void) +{ + /* it could update max_pfn */ + + if (max_pfn <= MAXMEM_PFN) + lowmem_pfn_init(); + else + highmem_pfn_init(); } #ifndef CONFIG_NEED_MULTIPLE_NODES @@ -871,7 +892,7 @@ unsigned long __init_refok init_memory_m pgd_t *pgd_base = swapper_pg_dir; unsigned long start_pfn, end_pfn; unsigned long big_page_start; -#ifdef CONFIG_DEBUG_PAGEALLOC +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK) /* * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages. * This will simplify cpa(), which otherwise needs to support splitting @@ -1155,17 +1176,47 @@ static noinline int do_test_wp_bit(void) const int rodata_test_data = 0xC3; EXPORT_SYMBOL_GPL(rodata_test_data); +static int kernel_set_to_readonly; + +void set_kernel_text_rw(void) +{ + unsigned long start = PFN_ALIGN(_text); + unsigned long size = PFN_ALIGN(_etext) - start; + + if (!kernel_set_to_readonly) + return; + + pr_debug("Set kernel text: %lx - %lx for read write\n", + start, start+size); + + set_pages_rw(virt_to_page(start), size >> PAGE_SHIFT); +} + +void set_kernel_text_ro(void) +{ + unsigned long start = PFN_ALIGN(_text); + unsigned long size = PFN_ALIGN(_etext) - start; + + if (!kernel_set_to_readonly) + return; + + pr_debug("Set kernel text: %lx - %lx for read only\n", + start, start+size); + + set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT); +} + void mark_rodata_ro(void) { unsigned long start = PFN_ALIGN(_text); unsigned long size = PFN_ALIGN(_etext) - start; -#ifndef CONFIG_DYNAMIC_FTRACE - /* Dynamic tracing modifies the kernel text section */ set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT); printk(KERN_INFO "Write protecting the kernel text: %luk\n", size >> 10); + kernel_set_to_readonly = 1; + #ifdef CONFIG_CPA_DEBUG printk(KERN_INFO "Testing CPA: Reverting %lx-%lx\n", start, start+size); @@ -1174,7 +1225,6 @@ void mark_rodata_ro(void) printk(KERN_INFO "Testing CPA: write protecting again\n"); set_pages_ro(virt_to_page(start), size>>PAGE_SHIFT); #endif -#endif /* CONFIG_DYNAMIC_FTRACE */ start += size; size = (unsigned long)__end_rodata - start; Index: linux-2.6-tip/arch/x86/mm/init_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/init_64.c +++ linux-2.6-tip/arch/x86/mm/init_64.c @@ -59,7 +59,7 @@ unsigned long max_pfn_mapped; static unsigned long dma_reserve __initdata; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); int direct_gbpages #ifdef CONFIG_DIRECT_GBPAGES @@ -689,7 +689,7 @@ unsigned long __init_refok init_memory_m if (!after_bootmem) init_gbpages(); -#ifdef CONFIG_DEBUG_PAGEALLOC +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK) /* * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages. * This will simplify cpa(), which otherwise needs to support splitting @@ -986,21 +986,48 @@ void free_initmem(void) const int rodata_test_data = 0xC3; EXPORT_SYMBOL_GPL(rodata_test_data); +static int kernel_set_to_readonly; + +void set_kernel_text_rw(void) +{ + unsigned long start = PFN_ALIGN(_stext); + unsigned long end = PFN_ALIGN(__start_rodata); + + if (!kernel_set_to_readonly) + return; + + pr_debug("Set kernel text: %lx - %lx for read write\n", + start, end); + + set_memory_rw(start, (end - start) >> PAGE_SHIFT); +} + +void set_kernel_text_ro(void) +{ + unsigned long start = PFN_ALIGN(_stext); + unsigned long end = PFN_ALIGN(__start_rodata); + + if (!kernel_set_to_readonly) + return; + + pr_debug("Set kernel text: %lx - %lx for read only\n", + start, end); + + set_memory_ro(start, (end - start) >> PAGE_SHIFT); +} + void mark_rodata_ro(void) { unsigned long start = PFN_ALIGN(_stext), end = PFN_ALIGN(__end_rodata); unsigned long rodata_start = ((unsigned long)__start_rodata + PAGE_SIZE - 1) & PAGE_MASK; -#ifdef CONFIG_DYNAMIC_FTRACE - /* Dynamic tracing modifies the kernel text section */ - start = rodata_start; -#endif - printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n", (end - start) >> 10); set_memory_ro(start, (end - start) >> PAGE_SHIFT); + kernel_set_to_readonly = 1; + /* * The rodata section (but not the kernel text!) should also be * not-executable. Index: linux-2.6-tip/arch/x86/mm/ioremap.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/ioremap.c +++ linux-2.6-tip/arch/x86/mm/ioremap.c @@ -348,7 +348,7 @@ EXPORT_SYMBOL(ioremap_nocache); * * Must be freed with iounmap. */ -void __iomem *ioremap_wc(unsigned long phys_addr, unsigned long size) +void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size) { if (pat_enabled) return __ioremap_caller(phys_addr, size, _PAGE_CACHE_WC, Index: linux-2.6-tip/arch/x86/mm/kmemcheck/Makefile =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/Makefile @@ -0,0 +1 @@ +obj-y := error.o kmemcheck.o opcode.o pte.o shadow.o Index: linux-2.6-tip/arch/x86/mm/kmemcheck/error.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/error.c @@ -0,0 +1,229 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "error.h" +#include "shadow.h" + +enum kmemcheck_error_type { + KMEMCHECK_ERROR_INVALID_ACCESS, + KMEMCHECK_ERROR_BUG, +}; + +#define SHADOW_COPY_SIZE (1 << CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT) + +struct kmemcheck_error { + enum kmemcheck_error_type type; + + union { + /* KMEMCHECK_ERROR_INVALID_ACCESS */ + struct { + /* Kind of access that caused the error */ + enum kmemcheck_shadow state; + /* Address and size of the erroneous read */ + unsigned long address; + unsigned int size; + }; + }; + + struct pt_regs regs; + struct stack_trace trace; + unsigned long trace_entries[32]; + + /* We compress it to a char. */ + unsigned char shadow_copy[SHADOW_COPY_SIZE]; + unsigned char memory_copy[SHADOW_COPY_SIZE]; +}; + +/* + * Create a ring queue of errors to output. We can't call printk() directly + * from the kmemcheck traps, since this may call the console drivers and + * result in a recursive fault. + */ +static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE]; +static unsigned int error_count; +static unsigned int error_rd; +static unsigned int error_wr; +static unsigned int error_missed_count; + +static struct kmemcheck_error *error_next_wr(void) +{ + struct kmemcheck_error *e; + + if (error_count == ARRAY_SIZE(error_fifo)) { + ++error_missed_count; + return NULL; + } + + e = &error_fifo[error_wr]; + if (++error_wr == ARRAY_SIZE(error_fifo)) + error_wr = 0; + ++error_count; + return e; +} + +static struct kmemcheck_error *error_next_rd(void) +{ + struct kmemcheck_error *e; + + if (error_count == 0) + return NULL; + + e = &error_fifo[error_rd]; + if (++error_rd == ARRAY_SIZE(error_fifo)) + error_rd = 0; + --error_count; + return e; +} + +static void do_wakeup(unsigned long); +static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); + +/* + * Save the context of an error report. + */ +void kmemcheck_error_save(enum kmemcheck_shadow state, + unsigned long address, unsigned int size, struct pt_regs *regs) +{ + static unsigned long prev_ip; + + struct kmemcheck_error *e; + void *shadow_copy; + void *memory_copy; + + /* Don't report several adjacent errors from the same EIP. */ + if (regs->ip == prev_ip) + return; + prev_ip = regs->ip; + + e = error_next_wr(); + if (!e) + return; + + e->type = KMEMCHECK_ERROR_INVALID_ACCESS; + + e->state = state; + e->address = address; + e->size = size; + + /* Save regs */ + memcpy(&e->regs, regs, sizeof(*regs)); + + /* Save stack trace */ + e->trace.nr_entries = 0; + e->trace.entries = e->trace_entries; + e->trace.max_entries = ARRAY_SIZE(e->trace_entries); + e->trace.skip = 0; + save_stack_trace_bp(&e->trace, regs->bp); + + /* Round address down to nearest 16 bytes */ + shadow_copy = kmemcheck_shadow_lookup(address + & ~(SHADOW_COPY_SIZE - 1)); + BUG_ON(!shadow_copy); + + memcpy(e->shadow_copy, shadow_copy, SHADOW_COPY_SIZE); + + kmemcheck_show_addr(address); + memory_copy = (void *) (address & ~(SHADOW_COPY_SIZE - 1)); + memcpy(e->memory_copy, memory_copy, SHADOW_COPY_SIZE); + kmemcheck_hide_addr(address); + + tasklet_hi_schedule_first(&kmemcheck_tasklet); +} + +/* + * Save the context of a kmemcheck bug. + */ +void kmemcheck_error_save_bug(struct pt_regs *regs) +{ + struct kmemcheck_error *e; + + e = error_next_wr(); + if (!e) + return; + + e->type = KMEMCHECK_ERROR_BUG; + + memcpy(&e->regs, regs, sizeof(*regs)); + + e->trace.nr_entries = 0; + e->trace.entries = e->trace_entries; + e->trace.max_entries = ARRAY_SIZE(e->trace_entries); + e->trace.skip = 1; + save_stack_trace(&e->trace); + + tasklet_hi_schedule_first(&kmemcheck_tasklet); +} + +void kmemcheck_error_recall(void) +{ + static const char *desc[] = { + [KMEMCHECK_SHADOW_UNALLOCATED] = "unallocated", + [KMEMCHECK_SHADOW_UNINITIALIZED] = "uninitialized", + [KMEMCHECK_SHADOW_INITIALIZED] = "initialized", + [KMEMCHECK_SHADOW_FREED] = "freed", + }; + + static const char short_desc[] = { + [KMEMCHECK_SHADOW_UNALLOCATED] = 'a', + [KMEMCHECK_SHADOW_UNINITIALIZED] = 'u', + [KMEMCHECK_SHADOW_INITIALIZED] = 'i', + [KMEMCHECK_SHADOW_FREED] = 'f', + }; + + struct kmemcheck_error *e; + unsigned int i; + + e = error_next_rd(); + if (!e) + return; + + switch (e->type) { + case KMEMCHECK_ERROR_INVALID_ACCESS: + printk(KERN_ERR "WARNING: kmemcheck: Caught %d-bit read " + "from %s memory (%p)\n", + 8 * e->size, e->state < ARRAY_SIZE(desc) ? + desc[e->state] : "(invalid shadow state)", + (void *) e->address); + + printk(KERN_INFO); + for (i = 0; i < SHADOW_COPY_SIZE; ++i) + printk("%02x", e->memory_copy[i]); + printk("\n"); + + printk(KERN_INFO); + for (i = 0; i < SHADOW_COPY_SIZE; ++i) { + if (e->shadow_copy[i] < ARRAY_SIZE(short_desc)) + printk(" %c", short_desc[e->shadow_copy[i]]); + else + printk(" ?"); + } + printk("\n"); + printk(KERN_INFO "%*c\n", 2 + 2 + * (int) (e->address & (SHADOW_COPY_SIZE - 1)), '^'); + break; + case KMEMCHECK_ERROR_BUG: + printk(KERN_EMERG "ERROR: kmemcheck: Fatal error\n"); + break; + } + + __show_regs(&e->regs, 1); + print_stack_trace(&e->trace, 0); +} + +static void do_wakeup(unsigned long data) +{ + while (error_count > 0) + kmemcheck_error_recall(); + + if (error_missed_count > 0) { + printk(KERN_WARNING "kmemcheck: Lost %d error reports because " + "the queue was too small\n", error_missed_count); + error_missed_count = 0; + } +} Index: linux-2.6-tip/arch/x86/mm/kmemcheck/error.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/error.h @@ -0,0 +1,15 @@ +#ifndef ARCH__X86__MM__KMEMCHECK__ERROR_H +#define ARCH__X86__MM__KMEMCHECK__ERROR_H + +#include + +#include "shadow.h" + +void kmemcheck_error_save(enum kmemcheck_shadow state, + unsigned long address, unsigned int size, struct pt_regs *regs); + +void kmemcheck_error_save_bug(struct pt_regs *regs); + +void kmemcheck_error_recall(void); + +#endif Index: linux-2.6-tip/arch/x86/mm/kmemcheck/kmemcheck.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/kmemcheck.c @@ -0,0 +1,632 @@ +/** + * kmemcheck - a heavyweight memory checker for the linux kernel + * Copyright (C) 2007, 2008 Vegard Nossum + * (With a lot of help from Ingo Molnar and Pekka Enberg.) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License (version 2) as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include "error.h" +#include "opcode.h" +#include "pte.h" +#include "shadow.h" + +void __init kmemcheck_init(void) +{ + printk(KERN_INFO "kmemcheck: \"Bugs, beware!\"\n"); + +#ifdef CONFIG_SMP + /* + * Limit SMP to use a single CPU. We rely on the fact that this code + * runs before SMP is set up. + */ + if (setup_max_cpus > 1) { + printk(KERN_INFO + "kmemcheck: Limiting number of CPUs to 1.\n"); + setup_max_cpus = 1; + } +#endif +} + +#ifdef CONFIG_KMEMCHECK_DISABLED_BY_DEFAULT +int kmemcheck_enabled = 0; +#endif + +#ifdef CONFIG_KMEMCHECK_ENABLED_BY_DEFAULT +int kmemcheck_enabled = 1; +#endif + +#ifdef CONFIG_KMEMCHECK_ONESHOT_BY_DEFAULT +int kmemcheck_enabled = 2; +#endif + +/* + * We need to parse the kmemcheck= option before any memory is allocated. + */ +static int __init param_kmemcheck(char *str) +{ + if (!str) + return -EINVAL; + + sscanf(str, "%d", &kmemcheck_enabled); + return 0; +} + +early_param("kmemcheck", param_kmemcheck); + +int kmemcheck_show_addr(unsigned long address) +{ + pte_t *pte; + + pte = kmemcheck_pte_lookup(address); + if (!pte) + return 0; + + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); + __flush_tlb_one(address); + return 1; +} + +int kmemcheck_hide_addr(unsigned long address) +{ + pte_t *pte; + + pte = kmemcheck_pte_lookup(address); + if (!pte) + return 0; + + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); + __flush_tlb_one(address); + return 1; +} + +struct kmemcheck_context { + bool busy; + int balance; + + /* + * There can be at most two memory operands to an instruction, but + * each address can cross a page boundary -- so we may need up to + * four addresses that must be hidden/revealed for each fault. + */ + unsigned long addr[4]; + unsigned long n_addrs; + unsigned long flags; + + /* Data size of the instruction that caused a fault. */ + unsigned int size; +}; + +static DEFINE_PER_CPU(struct kmemcheck_context, kmemcheck_context); + +bool kmemcheck_active(struct pt_regs *regs) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + + return data->balance > 0; +} + +/* Save an address that needs to be shown/hidden */ +static void kmemcheck_save_addr(unsigned long addr) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + + BUG_ON(data->n_addrs >= ARRAY_SIZE(data->addr)); + data->addr[data->n_addrs++] = addr; +} + +static unsigned int kmemcheck_show_all(void) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + unsigned int i; + unsigned int n; + + n = 0; + for (i = 0; i < data->n_addrs; ++i) + n += kmemcheck_show_addr(data->addr[i]); + + return n; +} + +static unsigned int kmemcheck_hide_all(void) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + unsigned int i; + unsigned int n; + + n = 0; + for (i = 0; i < data->n_addrs; ++i) + n += kmemcheck_hide_addr(data->addr[i]); + + return n; +} + +/* + * Called from the #PF handler. + */ +void kmemcheck_show(struct pt_regs *regs) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + + BUG_ON(!irqs_disabled()); + + if (unlikely(data->balance != 0)) { + kmemcheck_show_all(); + kmemcheck_error_save_bug(regs); + data->balance = 0; + return; + } + + /* + * None of the addresses actually belonged to kmemcheck. Note that + * this is not an error. + */ + if (kmemcheck_show_all() == 0) + return; + + ++data->balance; + + /* + * The IF needs to be cleared as well, so that the faulting + * instruction can run "uninterrupted". Otherwise, we might take + * an interrupt and start executing that before we've had a chance + * to hide the page again. + * + * NOTE: In the rare case of multiple faults, we must not override + * the original flags: + */ + if (!(regs->flags & X86_EFLAGS_TF)) + data->flags = regs->flags; + + regs->flags |= X86_EFLAGS_TF; + regs->flags &= ~X86_EFLAGS_IF; +} + +/* + * Called from the #DB handler. + */ +void kmemcheck_hide(struct pt_regs *regs) +{ + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + int n; + + BUG_ON(!irqs_disabled()); + + if (data->balance == 0) + return; + + if (unlikely(data->balance != 1)) { + kmemcheck_show_all(); + kmemcheck_error_save_bug(regs); + data->n_addrs = 0; + data->balance = 0; + + if (!(data->flags & X86_EFLAGS_TF)) + regs->flags &= ~X86_EFLAGS_TF; + if (data->flags & X86_EFLAGS_IF) + regs->flags |= X86_EFLAGS_IF; + return; + } + + if (kmemcheck_enabled) + n = kmemcheck_hide_all(); + else + n = kmemcheck_show_all(); + + if (n == 0) + return; + + --data->balance; + + data->n_addrs = 0; + + if (!(data->flags & X86_EFLAGS_TF)) + regs->flags &= ~X86_EFLAGS_TF; + if (data->flags & X86_EFLAGS_IF) + regs->flags |= X86_EFLAGS_IF; +} + +void kmemcheck_show_pages(struct page *p, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; ++i) { + unsigned long address; + pte_t *pte; + unsigned int level; + + address = (unsigned long) page_address(&p[i]); + pte = lookup_address(address, &level); + BUG_ON(!pte); + BUG_ON(level != PG_LEVEL_4K); + + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_HIDDEN)); + __flush_tlb_one(address); + } +} + +bool kmemcheck_page_is_tracked(struct page *p) +{ + /* This will also check the "hidden" flag of the PTE. */ + return kmemcheck_pte_lookup((unsigned long) page_address(p)); +} + +void kmemcheck_hide_pages(struct page *p, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; ++i) { + unsigned long address; + pte_t *pte; + unsigned int level; + + address = (unsigned long) page_address(&p[i]); + pte = lookup_address(address, &level); + BUG_ON(!pte); + BUG_ON(level != PG_LEVEL_4K); + + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); + set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN)); + __flush_tlb_one(address); + } +} + +/* Access may NOT cross page boundary */ +static void kmemcheck_read_strict(struct pt_regs *regs, + unsigned long addr, unsigned int size) +{ + void *shadow; + enum kmemcheck_shadow status; + + shadow = kmemcheck_shadow_lookup(addr); + if (!shadow) + return; + + kmemcheck_save_addr(addr); + status = kmemcheck_shadow_test(shadow, size); + if (status == KMEMCHECK_SHADOW_INITIALIZED) + return; + + if (kmemcheck_enabled) + kmemcheck_error_save(status, addr, size, regs); + + if (kmemcheck_enabled == 2) + kmemcheck_enabled = 0; + + /* Don't warn about it again. */ + kmemcheck_shadow_set(shadow, size); +} + +/* Access may cross page boundary */ +static void kmemcheck_read(struct pt_regs *regs, + unsigned long addr, unsigned int size) +{ + unsigned long page = addr & PAGE_MASK; + unsigned long next_addr = addr + size - 1; + unsigned long next_page = next_addr & PAGE_MASK; + + if (likely(page == next_page)) { + kmemcheck_read_strict(regs, addr, size); + return; + } + + /* + * What we do is basically to split the access across the + * two pages and handle each part separately. Yes, this means + * that we may now see reads that are 3 + 5 bytes, for + * example (and if both are uninitialized, there will be two + * reports), but it makes the code a lot simpler. + */ + kmemcheck_read_strict(regs, addr, next_page - addr); + kmemcheck_read_strict(regs, next_page, next_addr - next_page); +} + +static void kmemcheck_write_strict(struct pt_regs *regs, + unsigned long addr, unsigned int size) +{ + void *shadow; + + shadow = kmemcheck_shadow_lookup(addr); + if (!shadow) + return; + + kmemcheck_save_addr(addr); + kmemcheck_shadow_set(shadow, size); +} + +static void kmemcheck_write(struct pt_regs *regs, + unsigned long addr, unsigned int size) +{ + unsigned long page = addr & PAGE_MASK; + unsigned long next_addr = addr + size - 1; + unsigned long next_page = next_addr & PAGE_MASK; + + if (likely(page == next_page)) { + kmemcheck_write_strict(regs, addr, size); + return; + } + + /* See comment in kmemcheck_read(). */ + kmemcheck_write_strict(regs, addr, next_page - addr); + kmemcheck_write_strict(regs, next_page, next_addr - next_page); +} + +/* + * Copying is hard. We have two addresses, each of which may be split across + * a page (and each page will have different shadow addresses). + */ +static void kmemcheck_copy(struct pt_regs *regs, + unsigned long src_addr, unsigned long dst_addr, unsigned int size) +{ + uint8_t shadow[8]; + enum kmemcheck_shadow status; + + unsigned long page; + unsigned long next_addr; + unsigned long next_page; + + uint8_t *x; + unsigned int i; + unsigned int n; + + BUG_ON(size > sizeof(shadow)); + + page = src_addr & PAGE_MASK; + next_addr = src_addr + size - 1; + next_page = next_addr & PAGE_MASK; + + if (likely(page == next_page)) { + /* Same page */ + x = kmemcheck_shadow_lookup(src_addr); + if (x) { + kmemcheck_save_addr(src_addr); + for (i = 0; i < size; ++i) + shadow[i] = x[i]; + } else { + for (i = 0; i < size; ++i) + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + } else { + n = next_page - src_addr; + BUG_ON(n > sizeof(shadow)); + + /* First page */ + x = kmemcheck_shadow_lookup(src_addr); + if (x) { + kmemcheck_save_addr(src_addr); + for (i = 0; i < n; ++i) + shadow[i] = x[i]; + } else { + /* Not tracked */ + for (i = 0; i < n; ++i) + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + + /* Second page */ + x = kmemcheck_shadow_lookup(next_page); + if (x) { + kmemcheck_save_addr(next_page); + for (i = n; i < size; ++i) + shadow[i] = x[i - n]; + } else { + /* Not tracked */ + for (i = n; i < size; ++i) + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + } + + page = dst_addr & PAGE_MASK; + next_addr = dst_addr + size - 1; + next_page = next_addr & PAGE_MASK; + + if (likely(page == next_page)) { + /* Same page */ + x = kmemcheck_shadow_lookup(dst_addr); + if (x) { + kmemcheck_save_addr(dst_addr); + for (i = 0; i < size; ++i) { + x[i] = shadow[i]; + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + } + } else { + n = next_page - dst_addr; + BUG_ON(n > sizeof(shadow)); + + /* First page */ + x = kmemcheck_shadow_lookup(dst_addr); + if (x) { + kmemcheck_save_addr(dst_addr); + for (i = 0; i < n; ++i) { + x[i] = shadow[i]; + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + } + + /* Second page */ + x = kmemcheck_shadow_lookup(next_page); + if (x) { + kmemcheck_save_addr(next_page); + for (i = n; i < size; ++i) { + x[i - n] = shadow[i]; + shadow[i] = KMEMCHECK_SHADOW_INITIALIZED; + } + } + } + + status = kmemcheck_shadow_test(shadow, size); + if (status == KMEMCHECK_SHADOW_INITIALIZED) + return; + + if (kmemcheck_enabled) + kmemcheck_error_save(status, src_addr, size, regs); + + if (kmemcheck_enabled == 2) + kmemcheck_enabled = 0; +} + +enum kmemcheck_method { + KMEMCHECK_READ, + KMEMCHECK_WRITE, +}; + +static void kmemcheck_access(struct pt_regs *regs, + unsigned long fallback_address, enum kmemcheck_method fallback_method) +{ + const uint8_t *insn; + const uint8_t *insn_primary; + unsigned int size; + + struct kmemcheck_context *data = &__get_cpu_var(kmemcheck_context); + + /* Recursive fault -- ouch. */ + if (data->busy) { + kmemcheck_show_addr(fallback_address); + kmemcheck_error_save_bug(regs); + return; + } + + data->busy = true; + + insn = (const uint8_t *) regs->ip; + insn_primary = kmemcheck_opcode_get_primary(insn); + + kmemcheck_opcode_decode(insn, &size); + + switch (insn_primary[0]) { +#ifdef CONFIG_KMEMCHECK_BITOPS_OK + /* AND, OR, XOR */ + /* + * Unfortunately, these instructions have to be excluded from + * our regular checking since they access only some (and not + * all) bits. This clears out "bogus" bitfield-access warnings. + */ + case 0x80: + case 0x81: + case 0x82: + case 0x83: + switch ((insn_primary[1] >> 3) & 7) { + /* OR */ + case 1: + /* AND */ + case 4: + /* XOR */ + case 6: + kmemcheck_write(regs, fallback_address, size); + goto out; + + /* ADD */ + case 0: + /* ADC */ + case 2: + /* SBB */ + case 3: + /* SUB */ + case 5: + /* CMP */ + case 7: + break; + } + break; +#endif + + /* MOVS, MOVSB, MOVSW, MOVSD */ + case 0xa4: + case 0xa5: + /* + * These instructions are special because they take two + * addresses, but we only get one page fault. + */ + kmemcheck_copy(regs, regs->si, regs->di, size); + goto out; + + /* CMPS, CMPSB, CMPSW, CMPSD */ + case 0xa6: + case 0xa7: + kmemcheck_read(regs, regs->si, size); + kmemcheck_read(regs, regs->di, size); + goto out; + } + + /* + * If the opcode isn't special in any way, we use the data from the + * page fault handler to determine the address and type of memory + * access. + */ + switch (fallback_method) { + case KMEMCHECK_READ: + kmemcheck_read(regs, fallback_address, size); + goto out; + case KMEMCHECK_WRITE: + kmemcheck_write(regs, fallback_address, size); + goto out; + } + +out: + data->busy = false; +} + +bool kmemcheck_fault(struct pt_regs *regs, unsigned long address, + unsigned long error_code) +{ + pte_t *pte; + unsigned int level; + + /* + * XXX: Is it safe to assume that memory accesses from virtual 86 + * mode or non-kernel code segments will _never_ access kernel + * memory (e.g. tracked pages)? For now, we need this to avoid + * invoking kmemcheck for PnP BIOS calls. + */ + if (regs->flags & X86_VM_MASK) + return false; + if (regs->cs != __KERNEL_CS) + return false; + + pte = lookup_address(address, &level); + if (!pte) + return false; + if (level != PG_LEVEL_4K) + return false; + if (!pte_hidden(*pte)) + return false; + + if (error_code & 2) + kmemcheck_access(regs, address, KMEMCHECK_WRITE); + else + kmemcheck_access(regs, address, KMEMCHECK_READ); + + kmemcheck_show(regs); + return true; +} + +bool kmemcheck_trap(struct pt_regs *regs) +{ + if (!kmemcheck_active(regs)) + return false; + + /* We're done. */ + kmemcheck_hide(regs); + return true; +} Index: linux-2.6-tip/arch/x86/mm/kmemcheck/opcode.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/opcode.c @@ -0,0 +1,79 @@ +#include + +#include "opcode.h" + +static bool opcode_is_prefix(uint8_t b) +{ + return + /* Group 1 */ + b == 0xf0 || b == 0xf2 || b == 0xf3 + /* Group 2 */ + || b == 0x2e || b == 0x36 || b == 0x3e || b == 0x26 + || b == 0x64 || b == 0x65 || b == 0x2e || b == 0x3e + /* Group 3 */ + || b == 0x66 + /* Group 4 */ + || b == 0x67; +} + +static bool opcode_is_rex_prefix(uint8_t b) +{ + return (b & 0xf0) == 0x40; +} + +/* + * This is a VERY crude opcode decoder. We only need to find the size of the + * load/store that caused our #PF and this should work for all the opcodes + * that we care about. Moreover, the ones who invented this instruction set + * should be shot. + */ +void kmemcheck_opcode_decode(const uint8_t *op, unsigned int *size) +{ + /* Default operand size */ + int operand_size_override = 4; + + /* prefixes */ + for (; opcode_is_prefix(*op); ++op) { + if (*op == 0x66) + operand_size_override = 2; + } + +#ifdef CONFIG_X86_64 + /* REX prefix */ + if (opcode_is_rex_prefix(*op)) { + if (*op & 0x08) { + *size = 8; + return; + } + + ++op; + } +#endif + + /* escape opcode */ + if (*op == 0x0f) { + ++op; + + if (*op == 0xb6) { + *size = 1; + return; + } + + if (*op == 0xb7) { + *size = 2; + return; + } + } + + *size = (*op & 1) ? operand_size_override : 1; +} + +const uint8_t *kmemcheck_opcode_get_primary(const uint8_t *op) +{ + /* skip prefixes */ + while (opcode_is_prefix(*op)) + ++op; + if (opcode_is_rex_prefix(*op)) + ++op; + return op; +} Index: linux-2.6-tip/arch/x86/mm/kmemcheck/opcode.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/opcode.h @@ -0,0 +1,9 @@ +#ifndef ARCH__X86__MM__KMEMCHECK__OPCODE_H +#define ARCH__X86__MM__KMEMCHECK__OPCODE_H + +#include + +void kmemcheck_opcode_decode(const uint8_t *op, unsigned int *size); +const uint8_t *kmemcheck_opcode_get_primary(const uint8_t *op); + +#endif Index: linux-2.6-tip/arch/x86/mm/kmemcheck/pte.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/pte.c @@ -0,0 +1,22 @@ +#include + +#include + +#include "pte.h" + +pte_t *kmemcheck_pte_lookup(unsigned long address) +{ + pte_t *pte; + unsigned int level; + + pte = lookup_address(address, &level); + if (!pte) + return NULL; + if (level != PG_LEVEL_4K) + return NULL; + if (!pte_hidden(*pte)) + return NULL; + + return pte; +} + Index: linux-2.6-tip/arch/x86/mm/kmemcheck/pte.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/pte.h @@ -0,0 +1,10 @@ +#ifndef ARCH__X86__MM__KMEMCHECK__PTE_H +#define ARCH__X86__MM__KMEMCHECK__PTE_H + +#include + +#include + +pte_t *kmemcheck_pte_lookup(unsigned long address); + +#endif Index: linux-2.6-tip/arch/x86/mm/kmemcheck/shadow.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/shadow.c @@ -0,0 +1,162 @@ +#include +#include +#include +#include + +#include +#include + +#include "pte.h" +#include "shadow.h" + +/* + * Return the shadow address for the given address. Returns NULL if the + * address is not tracked. + * + * We need to be extremely careful not to follow any invalid pointers, + * because this function can be called for *any* possible address. + */ +void *kmemcheck_shadow_lookup(unsigned long address) +{ + pte_t *pte; + struct page *page; + + if (!virt_addr_valid(address)) + return NULL; + + pte = kmemcheck_pte_lookup(address); + if (!pte) + return NULL; + + page = virt_to_page(address); + if (!page->shadow) + return NULL; + return page->shadow + (address & (PAGE_SIZE - 1)); +} + +static void mark_shadow(void *address, unsigned int n, + enum kmemcheck_shadow status) +{ + unsigned long addr = (unsigned long) address; + unsigned long last_addr = addr + n - 1; + unsigned long page = addr & PAGE_MASK; + unsigned long last_page = last_addr & PAGE_MASK; + unsigned int first_n; + void *shadow; + + /* If the memory range crosses a page boundary, stop there. */ + if (page == last_page) + first_n = n; + else + first_n = page + PAGE_SIZE - addr; + + shadow = kmemcheck_shadow_lookup(addr); + if (shadow) + memset(shadow, status, first_n); + + addr += first_n; + n -= first_n; + + /* Do full-page memset()s. */ + while (n >= PAGE_SIZE) { + shadow = kmemcheck_shadow_lookup(addr); + if (shadow) + memset(shadow, status, PAGE_SIZE); + + addr += PAGE_SIZE; + n -= PAGE_SIZE; + } + + /* Do the remaining page, if any. */ + if (n > 0) { + shadow = kmemcheck_shadow_lookup(addr); + if (shadow) + memset(shadow, status, n); + } +} + +void kmemcheck_mark_unallocated(void *address, unsigned int n) +{ + mark_shadow(address, n, KMEMCHECK_SHADOW_UNALLOCATED); +} + +void kmemcheck_mark_uninitialized(void *address, unsigned int n) +{ + mark_shadow(address, n, KMEMCHECK_SHADOW_UNINITIALIZED); +} + +/* + * Fill the shadow memory of the given address such that the memory at that + * address is marked as being initialized. + */ +void kmemcheck_mark_initialized(void *address, unsigned int n) +{ + mark_shadow(address, n, KMEMCHECK_SHADOW_INITIALIZED); +} +EXPORT_SYMBOL_GPL(kmemcheck_mark_initialized); + +void kmemcheck_mark_freed(void *address, unsigned int n) +{ + mark_shadow(address, n, KMEMCHECK_SHADOW_FREED); +} + +void kmemcheck_mark_unallocated_pages(struct page *p, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; ++i) + kmemcheck_mark_unallocated(page_address(&p[i]), PAGE_SIZE); +} + +void kmemcheck_mark_uninitialized_pages(struct page *p, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; ++i) + kmemcheck_mark_uninitialized(page_address(&p[i]), PAGE_SIZE); +} + +void kmemcheck_mark_initialized_pages(struct page *p, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; ++i) + kmemcheck_mark_initialized(page_address(&p[i]), PAGE_SIZE); +} + +enum kmemcheck_shadow kmemcheck_shadow_test(void *shadow, unsigned int size) +{ + uint8_t *x; + unsigned int i; + + x = shadow; + +#ifdef CONFIG_KMEMCHECK_PARTIAL_OK + /* + * Make sure _some_ bytes are initialized. Gcc frequently generates + * code to access neighboring bytes. + */ + for (i = 0; i < size; ++i) { + if (x[i] == KMEMCHECK_SHADOW_INITIALIZED) + return x[i]; + } +#else + /* All bytes must be initialized. */ + for (i = 0; i < size; ++i) { + if (x[i] != KMEMCHECK_SHADOW_INITIALIZED) + return x[i]; + } +#endif + + return x[0]; +} + +void kmemcheck_shadow_set(void *shadow, unsigned int size) +{ + uint8_t *x; + unsigned int i; + + x = shadow; + for (i = 0; i < size; ++i) + x[i] = KMEMCHECK_SHADOW_INITIALIZED; +} Index: linux-2.6-tip/arch/x86/mm/kmemcheck/shadow.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/kmemcheck/shadow.h @@ -0,0 +1,16 @@ +#ifndef ARCH__X86__MM__KMEMCHECK__SHADOW_H +#define ARCH__X86__MM__KMEMCHECK__SHADOW_H + +enum kmemcheck_shadow { + KMEMCHECK_SHADOW_UNALLOCATED, + KMEMCHECK_SHADOW_UNINITIALIZED, + KMEMCHECK_SHADOW_INITIALIZED, + KMEMCHECK_SHADOW_FREED, +}; + +void *kmemcheck_shadow_lookup(unsigned long address); + +enum kmemcheck_shadow kmemcheck_shadow_test(void *shadow, unsigned int size); +void kmemcheck_shadow_set(void *shadow, unsigned int size); + +#endif Index: linux-2.6-tip/arch/x86/mm/mmap.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/mmap.c +++ linux-2.6-tip/arch/x86/mm/mmap.c @@ -4,7 +4,7 @@ * Based on code by Ingo Molnar and Andi Kleen, copyrighted * as follows: * - * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina. + * Copyright 2003-2009 Red Hat Inc. * All Rights Reserved. * Copyright 2005 Andi Kleen, SUSE Labs. * Copyright 2007 Jiri Kosina, SUSE Labs. Index: linux-2.6-tip/arch/x86/mm/numa_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/numa_32.c +++ linux-2.6-tip/arch/x86/mm/numa_32.c @@ -194,7 +194,7 @@ void *alloc_remap(int nid, unsigned long size = ALIGN(size, L1_CACHE_BYTES); if (!allocation || (allocation + size) >= node_remap_end_vaddr[nid]) - return 0; + return NULL; node_remap_alloc_vaddr[nid] += size; memset(allocation, 0, size); Index: linux-2.6-tip/arch/x86/mm/numa_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/numa_64.c +++ linux-2.6-tip/arch/x86/mm/numa_64.c @@ -20,6 +20,12 @@ #include #include +#ifdef CONFIG_DEBUG_PER_CPU_MAPS +# define DBG(x...) printk(KERN_DEBUG x) +#else +# define DBG(x...) +#endif + struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); @@ -33,6 +39,21 @@ int numa_off __initdata; static unsigned long __initdata nodemap_addr; static unsigned long __initdata nodemap_size; +DEFINE_PER_CPU(int, node_number) = 0; +EXPORT_PER_CPU_SYMBOL(node_number); + +/* + * Map cpu index to node index + */ +DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE); +EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map); + +/* + * Which logical CPUs are on which nodes + */ +cpumask_t *node_to_cpumask_map; +EXPORT_SYMBOL(node_to_cpumask_map); + /* * Given a shift value, try to populate memnodemap[] * Returns : @@ -640,3 +661,199 @@ void __init init_cpu_to_node(void) #endif +/* + * Allocate node_to_cpumask_map based on number of available nodes + * Requires node_possible_map to be valid. + * + * Note: node_to_cpumask() is not valid until after this is done. + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.) + */ +void __init setup_node_to_cpumask_map(void) +{ + unsigned int node, num = 0; + cpumask_t *map; + + /* setup nr_node_ids if not done yet */ + if (nr_node_ids == MAX_NUMNODES) { + for_each_node_mask(node, node_possible_map) + num = node; + nr_node_ids = num + 1; + } + + /* allocate the map */ + map = alloc_bootmem_low(nr_node_ids * sizeof(cpumask_t)); + DBG("node_to_cpumask_map at %p for %d nodes\n", map, nr_node_ids); + + pr_debug("Node to cpumask map at %p for %d nodes\n", + map, nr_node_ids); + + /* node_to_cpumask() will now work */ + node_to_cpumask_map = map; +} + +void __cpuinit numa_set_node(int cpu, int node) +{ + int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); + + /* early setting, no percpu area yet */ + if (cpu_to_node_map) { + cpu_to_node_map[cpu] = node; + return; + } + +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) { + printk(KERN_ERR "numa_set_node: invalid cpu# (%d)\n", cpu); + dump_stack(); + return; + } +#endif + per_cpu(x86_cpu_to_node_map, cpu) = node; + + if (node != NUMA_NO_NODE) + per_cpu(node_number, cpu) = node; +} + +void __cpuinit numa_clear_node(int cpu) +{ + numa_set_node(cpu, NUMA_NO_NODE); +} + +#ifndef CONFIG_DEBUG_PER_CPU_MAPS + +void __cpuinit numa_add_cpu(int cpu) +{ + cpu_set(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); +} + +void __cpuinit numa_remove_cpu(int cpu) +{ + cpu_clear(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); +} + +#else /* CONFIG_DEBUG_PER_CPU_MAPS */ + +/* + * --------- debug versions of the numa functions --------- + */ +static void __cpuinit numa_set_cpumask(int cpu, int enable) +{ + int node = early_cpu_to_node(cpu); + cpumask_t *mask; + char buf[64]; + + if (node_to_cpumask_map == NULL) { + printk(KERN_ERR "node_to_cpumask_map NULL\n"); + dump_stack(); + return; + } + + mask = &node_to_cpumask_map[node]; + if (enable) + cpu_set(cpu, *mask); + else + cpu_clear(cpu, *mask); + + cpulist_scnprintf(buf, sizeof(buf), mask); + printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n", + enable ? "numa_add_cpu" : "numa_remove_cpu", cpu, node, buf); +} + +void __cpuinit numa_add_cpu(int cpu) +{ + numa_set_cpumask(cpu, 1); +} + +void __cpuinit numa_remove_cpu(int cpu) +{ + numa_set_cpumask(cpu, 0); +} + +int cpu_to_node(int cpu) +{ + if (early_per_cpu_ptr(x86_cpu_to_node_map)) { + printk(KERN_WARNING + "cpu_to_node(%d): usage too early!\n", cpu); + dump_stack(); + return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu]; + } + return per_cpu(x86_cpu_to_node_map, cpu); +} +EXPORT_SYMBOL(cpu_to_node); + +/* + * Same function as cpu_to_node() but used if called before the + * per_cpu areas are setup. + */ +int early_cpu_to_node(int cpu) +{ + if (early_per_cpu_ptr(x86_cpu_to_node_map)) + return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu]; + + if (!cpu_possible(cpu)) { + printk(KERN_WARNING + "early_cpu_to_node(%d): no per_cpu area!\n", cpu); + dump_stack(); + return NUMA_NO_NODE; + } + return per_cpu(x86_cpu_to_node_map, cpu); +} + + +/* empty cpumask */ +static const cpumask_t cpu_mask_none; + +/* + * Returns a pointer to the bitmask of CPUs on Node 'node'. + */ +const cpumask_t *cpumask_of_node(int node) +{ + if (node_to_cpumask_map == NULL) { + printk(KERN_WARNING + "cpumask_of_node(%d): no node_to_cpumask_map!\n", + node); + dump_stack(); + return (const cpumask_t *)&cpu_online_map; + } + if (node >= nr_node_ids) { + printk(KERN_WARNING + "cpumask_of_node(%d): node > nr_node_ids(%d)\n", + node, nr_node_ids); + dump_stack(); + return &cpu_mask_none; + } + return &node_to_cpumask_map[node]; +} +EXPORT_SYMBOL(cpumask_of_node); + +/* + * Returns a bitmask of CPUs on Node 'node'. + * + * Side note: this function creates the returned cpumask on the stack + * so with a high NR_CPUS count, excessive stack space is used. The + * node_to_cpumask_ptr function should be used whenever possible. + */ +cpumask_t node_to_cpumask(int node) +{ + if (node_to_cpumask_map == NULL) { + printk(KERN_WARNING + "node_to_cpumask(%d): no node_to_cpumask_map!\n", node); + dump_stack(); + return cpu_online_map; + } + if (node >= nr_node_ids) { + printk(KERN_WARNING + "node_to_cpumask(%d): node > nr_node_ids(%d)\n", + node, nr_node_ids); + dump_stack(); + return cpu_mask_none; + } + return node_to_cpumask_map[node]; +} +EXPORT_SYMBOL(node_to_cpumask); + +/* + * --------- end of debug versions of the numa functions --------- + */ + +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ Index: linux-2.6-tip/arch/x86/mm/pageattr.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/pageattr.c +++ linux-2.6-tip/arch/x86/mm/pageattr.c @@ -482,6 +482,13 @@ static int split_large_page(pte_t *kpte, pbase = (pte_t *)page_address(base); paravirt_alloc_pte(&init_mm, page_to_pfn(base)); ref_prot = pte_pgprot(pte_clrhuge(*kpte)); + /* + * If we ever want to utilize the PAT bit, we need to + * update this function to make sure it's converted from + * bit 12 to bit 7 when we cross from the 2MB level to + * the 4K level: + */ + WARN_ON_ONCE(pgprot_val(ref_prot) & _PAGE_PAT_LARGE); #ifdef CONFIG_X86_64 if (level == PG_LEVEL_1G) { @@ -801,8 +808,10 @@ static int change_page_attr_set_clr(unsi } } +#if 0 /* Must avoid aliasing mappings in the highmem code */ kmap_flush_unused(); +#endif vm_unmap_aliases(); Index: linux-2.6-tip/arch/x86/mm/pat.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/pat.c +++ linux-2.6-tip/arch/x86/mm/pat.c @@ -30,7 +30,7 @@ #ifdef CONFIG_X86_PAT int __read_mostly pat_enabled = 1; -void __cpuinit pat_disable(char *reason) +void __cpuinit pat_disable(const char *reason) { pat_enabled = 0; printk(KERN_INFO "%s\n", reason); @@ -42,6 +42,11 @@ static int __init nopat(char *str) return 0; } early_param("nopat", nopat); +#else +static inline void pat_disable(const char *reason) +{ + (void)reason; +} #endif @@ -78,16 +83,20 @@ void pat_init(void) if (!pat_enabled) return; - /* Paranoia check. */ - if (!cpu_has_pat && boot_pat_state) { - /* - * If this happens we are on a secondary CPU, but - * switched to PAT on the boot CPU. We have no way to - * undo PAT. - */ - printk(KERN_ERR "PAT enabled, " - "but not supported by secondary CPU\n"); - BUG(); + if (!cpu_has_pat) { + if (!boot_pat_state) { + pat_disable("PAT not supported by CPU."); + return; + } else { + /* + * If this happens we are on a secondary CPU, but + * switched to PAT on the boot CPU. We have no way to + * undo PAT. + */ + printk(KERN_ERR "PAT enabled, " + "but not supported by secondary CPU\n"); + BUG(); + } } /* Set PWT to Write-Combining. All other bits stay the same */ Index: linux-2.6-tip/arch/x86/mm/pgtable.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/pgtable.c +++ linux-2.6-tip/arch/x86/mm/pgtable.c @@ -4,9 +4,11 @@ #include #include +#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO + pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); + return (pte_t *)__get_free_page(PGALLOC_GFP); } pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address) @@ -14,9 +16,9 @@ pgtable_t pte_alloc_one(struct mm_struct struct page *pte; #ifdef CONFIG_HIGHPTE - pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0); + pte = alloc_pages(PGALLOC_GFP | __GFP_HIGHMEM, 0); #else - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); + pte = alloc_pages(PGALLOC_GFP, 0); #endif if (pte) pgtable_page_ctor(pte); @@ -161,7 +163,7 @@ static int preallocate_pmds(pmd_t *pmds[ bool failed = false; for(i = 0; i < PREALLOCATED_PMDS; i++) { - pmd_t *pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmd = (pmd_t *)__get_free_page(PGALLOC_GFP); if (pmd == NULL) failed = true; pmds[i] = pmd; @@ -228,7 +230,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) pmd_t *pmds[PREALLOCATED_PMDS]; unsigned long flags; - pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + pgd = (pgd_t *)__get_free_page(PGALLOC_GFP); if (pgd == NULL) goto out; Index: linux-2.6-tip/arch/x86/mm/srat_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/srat_64.c +++ linux-2.6-tip/arch/x86/mm/srat_64.c @@ -20,7 +20,8 @@ #include #include #include -#include +#include +#include int acpi_numa __initdata; Index: linux-2.6-tip/arch/x86/mm/tlb.c =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/mm/tlb.c @@ -0,0 +1,295 @@ +#include + +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) + = { &init_mm, 0, }; + +/* + * Smarter SMP flushing macros. + * c/o Linus Torvalds. + * + * These mean you can really definitely utterly forget about + * writing to user space from interrupts. (Its not allowed anyway). + * + * Optimizations Manfred Spraul + * + * More scalable flush, from Andi Kleen + * + * To avoid global state use 8 different call vectors. + * Each CPU uses a specific vector to trigger flushes on other + * CPUs. Depending on the received vector the target CPUs look into + * the right array slot for the flush data. + * + * With more than 8 CPUs they are hashed to the 8 available + * vectors. The limited global vector space forces us to this right now. + * In future when interrupts are split into per CPU domains this could be + * fixed, at the cost of triggering multiple IPIs in some cases. + */ + +union smp_flush_state { + struct { + struct mm_struct *flush_mm; + unsigned long flush_va; + DECLARE_BITMAP(flush_cpumask, NR_CPUS); + raw_spinlock_t tlbstate_lock; + }; + char pad[CONFIG_X86_INTERNODE_CACHE_BYTES]; +} ____cacheline_internodealigned_in_smp; + +/* State is put into the per CPU data section, but padded + to a full cache line because other CPUs can access it and we don't + want false sharing in the per cpu data segment. */ +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS]; + +/* + * We cannot call mmdrop() because we are in interrupt context, + * instead update mm->cpu_vm_mask. + */ +void leave_mm(int cpu) +{ + if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) + BUG(); + cpu_clear(cpu, percpu_read(cpu_tlbstate.active_mm)->cpu_vm_mask); + load_cr3(swapper_pg_dir); +} +EXPORT_SYMBOL_GPL(leave_mm); + +/* + * + * The flush IPI assumes that a thread switch happens in this order: + * [cpu0: the cpu that switches] + * 1) switch_mm() either 1a) or 1b) + * 1a) thread switch to a different mm + * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask); + * Stop ipi delivery for the old mm. This is not synchronized with + * the other cpus, but smp_invalidate_interrupt ignore flush ipis + * for the wrong mm, and in the worst case we perform a superfluous + * tlb flush. + * 1a2) set cpu mmu_state to TLBSTATE_OK + * Now the smp_invalidate_interrupt won't call leave_mm if cpu0 + * was in lazy tlb mode. + * 1a3) update cpu active_mm + * Now cpu0 accepts tlb flushes for the new mm. + * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask); + * Now the other cpus will send tlb flush ipis. + * 1a4) change cr3. + * 1b) thread switch without mm change + * cpu active_mm is correct, cpu0 already handles + * flush ipis. + * 1b1) set cpu mmu_state to TLBSTATE_OK + * 1b2) test_and_set the cpu bit in cpu_vm_mask. + * Atomically set the bit [other cpus will start sending flush ipis], + * and test the bit. + * 1b3) if the bit was 0: leave_mm was called, flush the tlb. + * 2) switch %%esp, ie current + * + * The interrupt must handle 2 special cases: + * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm. + * - the cpu performs speculative tlb reads, i.e. even if the cpu only + * runs in kernel space, the cpu could load tlb entries for user space + * pages. + * + * The good news is that cpu mmu_state is local to each cpu, no + * write/read ordering problems. + */ + +/* + * TLB flush IPI: + * + * 1) Flush the tlb entries if the cpu uses the mm that's being flushed. + * 2) Leave the mm if we are in the lazy tlb mode. + * + * Interrupts are disabled. + */ + +/* + * FIXME: use of asmlinkage is not consistent. On x86_64 it's noop + * but still used for documentation purpose but the usage is slightly + * inconsistent. On x86_32, asmlinkage is regparm(0) but interrupt + * entry calls in with the first parameter in %eax. Maybe define + * intrlinkage? + */ +#ifdef CONFIG_X86_64 +asmlinkage +#endif +void smp_invalidate_interrupt(struct pt_regs *regs) +{ + unsigned int cpu; + unsigned int sender; + union smp_flush_state *f; + + cpu = smp_processor_id(); + /* + * orig_rax contains the negated interrupt vector. + * Use that to determine where the sender put the data. + */ + sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START; + f = &flush_state[sender]; + + if (!cpumask_test_cpu(cpu, to_cpumask(f->flush_cpumask))) + goto out; + /* + * This was a BUG() but until someone can quote me the + * line from the intel manual that guarantees an IPI to + * multiple CPUs is retried _only_ on the erroring CPUs + * its staying as a return + * + * BUG(); + */ + + if (f->flush_mm == percpu_read(cpu_tlbstate.active_mm)) { + if (percpu_read(cpu_tlbstate.state) == TLBSTATE_OK) { + if (f->flush_va == TLB_FLUSH_ALL) + local_flush_tlb(); + else + __flush_tlb_one(f->flush_va); + } else + leave_mm(cpu); + } +out: + ack_APIC_irq(); + smp_mb__before_clear_bit(); + cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask)); + smp_mb__after_clear_bit(); + inc_irq_stat(irq_tlb_count); +} + +static void flush_tlb_others_ipi(const struct cpumask *cpumask, + struct mm_struct *mm, unsigned long va) +{ + unsigned int sender; + union smp_flush_state *f; + + /* Caller has disabled preemption */ + sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS; + f = &flush_state[sender]; + + /* + * Could avoid this lock when + * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is + * probably not worth checking this for a cache-hot lock. + */ + spin_lock(&f->tlbstate_lock); + + f->flush_mm = mm; + f->flush_va = va; + cpumask_andnot(to_cpumask(f->flush_cpumask), + cpumask, cpumask_of(smp_processor_id())); + + /* + * Make the above memory operations globally visible before + * sending the IPI. + */ + smp_mb(); + /* + * We have to send the IPI only to + * CPUs affected. + */ + apic->send_IPI_mask(to_cpumask(f->flush_cpumask), + INVALIDATE_TLB_VECTOR_START + sender); + + while (!cpumask_empty(to_cpumask(f->flush_cpumask))) + cpu_relax(); + + f->flush_mm = NULL; + f->flush_va = 0; + spin_unlock(&f->tlbstate_lock); +} + +void native_flush_tlb_others(const struct cpumask *cpumask, + struct mm_struct *mm, unsigned long va) +{ + if (is_uv_system()) { + unsigned int cpu; + + cpu = get_cpu(); + cpumask = uv_flush_tlb_others(cpumask, mm, va, cpu); + if (cpumask) + flush_tlb_others_ipi(cpumask, mm, va); + put_cpu(); + return; + } + flush_tlb_others_ipi(cpumask, mm, va); +} + +static int __cpuinit init_smp_flush(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(flush_state); i++) + spin_lock_init(&flush_state[i].tlbstate_lock); + + return 0; +} +core_initcall(init_smp_flush); + +void flush_tlb_current_task(void) +{ + struct mm_struct *mm = current->mm; + + preempt_disable(); + + local_flush_tlb(); + if (cpumask_any_but(&mm->cpu_vm_mask, smp_processor_id()) < nr_cpu_ids) + flush_tlb_others(&mm->cpu_vm_mask, mm, TLB_FLUSH_ALL); + preempt_enable(); +} + +void flush_tlb_mm(struct mm_struct *mm) +{ + preempt_disable(); + + if (current->active_mm == mm) { + if (current->mm) + local_flush_tlb(); + else + leave_mm(smp_processor_id()); + } + if (cpumask_any_but(&mm->cpu_vm_mask, smp_processor_id()) < nr_cpu_ids) + flush_tlb_others(&mm->cpu_vm_mask, mm, TLB_FLUSH_ALL); + + preempt_enable(); +} + +void flush_tlb_page(struct vm_area_struct *vma, unsigned long va) +{ + struct mm_struct *mm = vma->vm_mm; + + preempt_disable(); + + if (current->active_mm == mm) { + if (current->mm) + __flush_tlb_one(va); + else + leave_mm(smp_processor_id()); + } + + if (cpumask_any_but(&mm->cpu_vm_mask, smp_processor_id()) < nr_cpu_ids) + flush_tlb_others(&mm->cpu_vm_mask, mm, va); + + preempt_enable(); +} + +static void do_flush_tlb_all(void *info) +{ + unsigned long cpu = smp_processor_id(); + + __flush_tlb_all(); + if (percpu_read(cpu_tlbstate.state) == TLBSTATE_LAZY) + leave_mm(cpu); +} + +void flush_tlb_all(void) +{ + on_each_cpu(do_flush_tlb_all, NULL, 1); +} Index: linux-2.6-tip/arch/x86/oprofile/nmi_int.c =================================================================== --- linux-2.6-tip.orig/arch/x86/oprofile/nmi_int.c +++ linux-2.6-tip/arch/x86/oprofile/nmi_int.c @@ -40,8 +40,9 @@ static int profile_exceptions_notify(str switch (val) { case DIE_NMI: - if (model->check_ctrs(args->regs, &per_cpu(cpu_msrs, cpu))) - ret = NOTIFY_STOP; + case DIE_NMI_IPI: + model->check_ctrs(args->regs, &per_cpu(cpu_msrs, cpu)); + ret = NOTIFY_STOP; break; default: break; @@ -134,7 +135,7 @@ static void nmi_cpu_setup(void *dummy) static struct notifier_block profile_exceptions_nb = { .notifier_call = profile_exceptions_notify, .next = NULL, - .priority = 0 + .priority = 2 }; static int nmi_setup(void) Index: linux-2.6-tip/arch/x86/oprofile/op_model_ppro.c =================================================================== --- linux-2.6-tip.orig/arch/x86/oprofile/op_model_ppro.c +++ linux-2.6-tip/arch/x86/oprofile/op_model_ppro.c @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include "op_x86_model.h" #include "op_counter.h" @@ -126,6 +126,13 @@ static int ppro_check_ctrs(struct pt_reg u64 val; int i; + /* + * This can happen if perf counters are in use when + * we steal the die notifier NMI. + */ + if (unlikely(!reset_value)) + goto out; + for (i = 0 ; i < num_counters; ++i) { if (!reset_value[i]) continue; @@ -136,6 +143,7 @@ static int ppro_check_ctrs(struct pt_reg } } +out: /* Only P6 based Pentium M need to re-unmask the apic vector but it * doesn't hurt other P6 variant */ apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); Index: linux-2.6-tip/arch/x86/pci/amd_bus.c =================================================================== --- linux-2.6-tip.orig/arch/x86/pci/amd_bus.c +++ linux-2.6-tip/arch/x86/pci/amd_bus.c @@ -277,8 +277,8 @@ static int __init early_fill_mp_bus_info { int i; int j; - unsigned bus; - unsigned slot; + unsigned uninitialized_var(bus); + unsigned uninitialized_var(slot); int found; int node; int link; Index: linux-2.6-tip/arch/x86/pci/numaq_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/pci/numaq_32.c +++ linux-2.6-tip/arch/x86/pci/numaq_32.c @@ -5,7 +5,7 @@ #include #include #include -#include +#include #include #include @@ -18,10 +18,6 @@ #define QUADLOCAL2BUS(quad,local) (quad_local_to_mp_bus_id[quad][local]) -/* Where the IO area was mapped on multiquad, always 0 otherwise */ -void *xquad_portio; -EXPORT_SYMBOL(xquad_portio); - #define XQUAD_PORT_ADDR(port, quad) (xquad_portio + (XQUAD_PORTIO_QUAD*quad) + port) #define PCI_CONF1_MQ_ADDRESS(bus, devfn, reg) \ Index: linux-2.6-tip/arch/x86/pci/pcbios.c =================================================================== --- linux-2.6-tip.orig/arch/x86/pci/pcbios.c +++ linux-2.6-tip/arch/x86/pci/pcbios.c @@ -7,7 +7,7 @@ #include #include #include -#include +#include /* BIOS32 signature: "_32_" */ #define BIOS32_SIGNATURE (('_' << 0) + ('3' << 8) + ('2' << 16) + ('_' << 24)) Index: linux-2.6-tip/arch/x86/power/hibernate_asm_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/power/hibernate_asm_32.S +++ linux-2.6-tip/arch/x86/power/hibernate_asm_32.S @@ -8,7 +8,7 @@ #include #include -#include +#include #include #include Index: linux-2.6-tip/arch/x86/power/hibernate_asm_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/power/hibernate_asm_64.S +++ linux-2.6-tip/arch/x86/power/hibernate_asm_64.S @@ -18,7 +18,7 @@ .text #include #include -#include +#include #include #include Index: linux-2.6-tip/arch/x86/vdso/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/vdso/Makefile +++ linux-2.6-tip/arch/x86/vdso/Makefile @@ -38,7 +38,7 @@ $(obj)/%.so: $(obj)/%.so.dbg FORCE $(call if_changed,objcopy) CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \ - $(filter -g%,$(KBUILD_CFLAGS)) + $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector) $(vobjs): KBUILD_CFLAGS += $(CFL) Index: linux-2.6-tip/arch/x86/vdso/vma.c =================================================================== --- linux-2.6-tip.orig/arch/x86/vdso/vma.c +++ linux-2.6-tip/arch/x86/vdso/vma.c @@ -85,8 +85,8 @@ static unsigned long vdso_addr(unsigned unsigned long addr, end; unsigned offset; end = (start + PMD_SIZE - 1) & PMD_MASK; - if (end >= TASK_SIZE64) - end = TASK_SIZE64; + if (end >= TASK_SIZE_MAX) + end = TASK_SIZE_MAX; end -= len; /* This loses some more bits than a modulo, but is cheaper */ offset = get_random_int() & (PTRS_PER_PTE - 1); Index: linux-2.6-tip/arch/x86/xen/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/Kconfig +++ linux-2.6-tip/arch/x86/xen/Kconfig @@ -6,7 +6,7 @@ config XEN bool "Xen guest support" select PARAVIRT select PARAVIRT_CLOCK - depends on X86_64 || (X86_32 && X86_PAE && !(X86_VISWS || X86_VOYAGER)) + depends on X86_64 || (X86_32 && X86_PAE && !X86_VISWS) depends on X86_CMPXCHG && X86_TSC help This is the Linux Xen port. Enabling this will allow the Index: linux-2.6-tip/arch/x86/xen/Makefile =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/Makefile +++ linux-2.6-tip/arch/x86/xen/Makefile @@ -6,7 +6,8 @@ CFLAGS_REMOVE_irq.o = -pg endif obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ - time.o xen-asm_$(BITS).o grant-table.o suspend.o + time.o xen-asm.o xen-asm_$(BITS).o \ + grant-table.o suspend.o obj-$(CONFIG_SMP) += smp.o spinlock.o obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o \ No newline at end of file Index: linux-2.6-tip/arch/x86/xen/enlighten.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/enlighten.c +++ linux-2.6-tip/arch/x86/xen/enlighten.c @@ -61,40 +61,13 @@ DEFINE_PER_CPU(struct vcpu_info, xen_vcp enum xen_domain_type xen_domain_type = XEN_NATIVE; EXPORT_SYMBOL_GPL(xen_domain_type); -/* - * Identity map, in addition to plain kernel map. This needs to be - * large enough to allocate page table pages to allocate the rest. - * Each page can map 2MB. - */ -static pte_t level1_ident_pgt[PTRS_PER_PTE * 4] __page_aligned_bss; - -#ifdef CONFIG_X86_64 -/* l3 pud for userspace vsyscall mapping */ -static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss; -#endif /* CONFIG_X86_64 */ - -/* - * Note about cr3 (pagetable base) values: - * - * xen_cr3 contains the current logical cr3 value; it contains the - * last set cr3. This may not be the current effective cr3, because - * its update may be being lazily deferred. However, a vcpu looking - * at its own cr3 can use this value knowing that it everything will - * be self-consistent. - * - * xen_current_cr3 contains the actual vcpu cr3; it is set once the - * hypercall to set the vcpu cr3 is complete (so it may be a little - * out of date, but it will never be set early). If one vcpu is - * looking at another vcpu's cr3 value, it should use this variable. - */ -DEFINE_PER_CPU(unsigned long, xen_cr3); /* cr3 stored as physaddr */ -DEFINE_PER_CPU(unsigned long, xen_current_cr3); /* actual vcpu cr3 */ - struct start_info *xen_start_info; EXPORT_SYMBOL_GPL(xen_start_info); struct shared_info xen_dummy_shared_info; +void *xen_initial_gdt; + /* * Point at some empty memory to start with. We map the real shared_info * page as soon as fixmap is up and running. @@ -114,14 +87,7 @@ struct shared_info *HYPERVISOR_shared_in * * 0: not available, 1: available */ -static int have_vcpu_info_placement = -#ifdef CONFIG_X86_32 - 1 -#else - 0 -#endif - ; - +static int have_vcpu_info_placement = 1; static void xen_vcpu_setup(int cpu) { @@ -237,7 +203,7 @@ static unsigned long xen_get_debugreg(in return HYPERVISOR_get_debugreg(reg); } -static void xen_leave_lazy(void) +void xen_leave_lazy(void) { paravirt_leave_lazy(paravirt_get_lazy_mode()); xen_mc_flush(); @@ -357,13 +323,14 @@ static void load_TLS_descriptor(struct t static void xen_load_tls(struct thread_struct *t, unsigned int cpu) { /* - * XXX sleazy hack: If we're being called in a lazy-cpu zone, - * it means we're in a context switch, and %gs has just been - * saved. This means we can zero it out to prevent faults on - * exit from the hypervisor if the next process has no %gs. - * Either way, it has been saved, and the new value will get - * loaded properly. This will go away as soon as Xen has been - * modified to not save/restore %gs for normal hypercalls. + * XXX sleazy hack: If we're being called in a lazy-cpu zone + * and lazy gs handling is enabled, it means we're in a + * context switch, and %gs has just been saved. This means we + * can zero it out to prevent faults on exit from the + * hypervisor if the next process has no %gs. Either way, it + * has been saved, and the new value will get loaded properly. + * This will go away as soon as Xen has been modified to not + * save/restore %gs for normal hypercalls. * * On x86_64, this hack is not used for %gs, because gs points * to KERNEL_GS_BASE (and uses it for PDA references), so we @@ -375,7 +342,7 @@ static void xen_load_tls(struct thread_s */ if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_CPU) { #ifdef CONFIG_X86_32 - loadsegment(gs, 0); + lazy_load_gs(0); #else loadsegment(fs, 0); #endif @@ -587,94 +554,18 @@ static u32 xen_safe_apic_wait_icr_idle(v return 0; } -static struct apic_ops xen_basic_apic_ops = { - .read = xen_apic_read, - .write = xen_apic_write, - .icr_read = xen_apic_icr_read, - .icr_write = xen_apic_icr_write, - .wait_icr_idle = xen_apic_wait_icr_idle, - .safe_wait_icr_idle = xen_safe_apic_wait_icr_idle, -}; - -#endif - -static void xen_flush_tlb(void) +static void set_xen_basic_apic_ops(void) { - struct mmuext_op *op; - struct multicall_space mcs; - - preempt_disable(); - - mcs = xen_mc_entry(sizeof(*op)); - - op = mcs.args; - op->cmd = MMUEXT_TLB_FLUSH_LOCAL; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - - xen_mc_issue(PARAVIRT_LAZY_MMU); - - preempt_enable(); -} - -static void xen_flush_tlb_single(unsigned long addr) -{ - struct mmuext_op *op; - struct multicall_space mcs; - - preempt_disable(); - - mcs = xen_mc_entry(sizeof(*op)); - op = mcs.args; - op->cmd = MMUEXT_INVLPG_LOCAL; - op->arg1.linear_addr = addr & PAGE_MASK; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - - xen_mc_issue(PARAVIRT_LAZY_MMU); - - preempt_enable(); + apic->read = xen_apic_read; + apic->write = xen_apic_write; + apic->icr_read = xen_apic_icr_read; + apic->icr_write = xen_apic_icr_write; + apic->wait_icr_idle = xen_apic_wait_icr_idle; + apic->safe_wait_icr_idle = xen_safe_apic_wait_icr_idle; } -static void xen_flush_tlb_others(const cpumask_t *cpus, struct mm_struct *mm, - unsigned long va) -{ - struct { - struct mmuext_op op; - cpumask_t mask; - } *args; - cpumask_t cpumask = *cpus; - struct multicall_space mcs; - - /* - * A couple of (to be removed) sanity checks: - * - * - current CPU must not be in mask - * - mask must exist :) - */ - BUG_ON(cpus_empty(cpumask)); - BUG_ON(cpu_isset(smp_processor_id(), cpumask)); - BUG_ON(!mm); - - /* If a CPU which we ran on has gone down, OK. */ - cpus_and(cpumask, cpumask, cpu_online_map); - if (cpus_empty(cpumask)) - return; - - mcs = xen_mc_entry(sizeof(*args)); - args = mcs.args; - args->mask = cpumask; - args->op.arg2.vcpumask = &args->mask; - - if (va == TLB_FLUSH_ALL) { - args->op.cmd = MMUEXT_TLB_FLUSH_MULTI; - } else { - args->op.cmd = MMUEXT_INVLPG_MULTI; - args->op.arg1.linear_addr = va; - } - - MULTI_mmuext_op(mcs.mc, &args->op, 1, NULL, DOMID_SELF); +#endif - xen_mc_issue(PARAVIRT_LAZY_MMU); -} static void xen_clts(void) { @@ -700,21 +591,6 @@ static void xen_write_cr0(unsigned long xen_mc_issue(PARAVIRT_LAZY_CPU); } -static void xen_write_cr2(unsigned long cr2) -{ - x86_read_percpu(xen_vcpu)->arch.cr2 = cr2; -} - -static unsigned long xen_read_cr2(void) -{ - return x86_read_percpu(xen_vcpu)->arch.cr2; -} - -static unsigned long xen_read_cr2_direct(void) -{ - return x86_read_percpu(xen_vcpu_info.arch.cr2); -} - static void xen_write_cr4(unsigned long cr4) { cr4 &= ~X86_CR4_PGE; @@ -723,71 +599,6 @@ static void xen_write_cr4(unsigned long native_write_cr4(cr4); } -static unsigned long xen_read_cr3(void) -{ - return x86_read_percpu(xen_cr3); -} - -static void set_current_cr3(void *v) -{ - x86_write_percpu(xen_current_cr3, (unsigned long)v); -} - -static void __xen_write_cr3(bool kernel, unsigned long cr3) -{ - struct mmuext_op *op; - struct multicall_space mcs; - unsigned long mfn; - - if (cr3) - mfn = pfn_to_mfn(PFN_DOWN(cr3)); - else - mfn = 0; - - WARN_ON(mfn == 0 && kernel); - - mcs = __xen_mc_entry(sizeof(*op)); - - op = mcs.args; - op->cmd = kernel ? MMUEXT_NEW_BASEPTR : MMUEXT_NEW_USER_BASEPTR; - op->arg1.mfn = mfn; - - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - - if (kernel) { - x86_write_percpu(xen_cr3, cr3); - - /* Update xen_current_cr3 once the batch has actually - been submitted. */ - xen_mc_callback(set_current_cr3, (void *)cr3); - } -} - -static void xen_write_cr3(unsigned long cr3) -{ - BUG_ON(preemptible()); - - xen_mc_batch(); /* disables interrupts */ - - /* Update while interrupts are disabled, so its atomic with - respect to ipis */ - x86_write_percpu(xen_cr3, cr3); - - __xen_write_cr3(true, cr3); - -#ifdef CONFIG_X86_64 - { - pgd_t *user_pgd = xen_get_user_pgd(__va(cr3)); - if (user_pgd) - __xen_write_cr3(false, __pa(user_pgd)); - else - __xen_write_cr3(false, 0); - } -#endif - - xen_mc_issue(PARAVIRT_LAZY_CPU); /* interrupts restored */ -} - static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high) { int ret; @@ -829,185 +640,6 @@ static int xen_write_msr_safe(unsigned i return ret; } -/* Early in boot, while setting up the initial pagetable, assume - everything is pinned. */ -static __init void xen_alloc_pte_init(struct mm_struct *mm, unsigned long pfn) -{ -#ifdef CONFIG_FLATMEM - BUG_ON(mem_map); /* should only be used early */ -#endif - make_lowmem_page_readonly(__va(PFN_PHYS(pfn))); -} - -/* Early release_pte assumes that all pts are pinned, since there's - only init_mm and anything attached to that is pinned. */ -static void xen_release_pte_init(unsigned long pfn) -{ - make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); -} - -static void pin_pagetable_pfn(unsigned cmd, unsigned long pfn) -{ - struct mmuext_op op; - op.cmd = cmd; - op.arg1.mfn = pfn_to_mfn(pfn); - if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF)) - BUG(); -} - -/* This needs to make sure the new pte page is pinned iff its being - attached to a pinned pagetable. */ -static void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn, unsigned level) -{ - struct page *page = pfn_to_page(pfn); - - if (PagePinned(virt_to_page(mm->pgd))) { - SetPagePinned(page); - - vm_unmap_aliases(); - if (!PageHighMem(page)) { - make_lowmem_page_readonly(__va(PFN_PHYS((unsigned long)pfn))); - if (level == PT_PTE && USE_SPLIT_PTLOCKS) - pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn); - } else { - /* make sure there are no stray mappings of - this page */ - kmap_flush_unused(); - } - } -} - -static void xen_alloc_pte(struct mm_struct *mm, unsigned long pfn) -{ - xen_alloc_ptpage(mm, pfn, PT_PTE); -} - -static void xen_alloc_pmd(struct mm_struct *mm, unsigned long pfn) -{ - xen_alloc_ptpage(mm, pfn, PT_PMD); -} - -static int xen_pgd_alloc(struct mm_struct *mm) -{ - pgd_t *pgd = mm->pgd; - int ret = 0; - - BUG_ON(PagePinned(virt_to_page(pgd))); - -#ifdef CONFIG_X86_64 - { - struct page *page = virt_to_page(pgd); - pgd_t *user_pgd; - - BUG_ON(page->private != 0); - - ret = -ENOMEM; - - user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); - page->private = (unsigned long)user_pgd; - - if (user_pgd != NULL) { - user_pgd[pgd_index(VSYSCALL_START)] = - __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE); - ret = 0; - } - - BUG_ON(PagePinned(virt_to_page(xen_get_user_pgd(pgd)))); - } -#endif - - return ret; -} - -static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd) -{ -#ifdef CONFIG_X86_64 - pgd_t *user_pgd = xen_get_user_pgd(pgd); - - if (user_pgd) - free_page((unsigned long)user_pgd); -#endif -} - -/* This should never happen until we're OK to use struct page */ -static void xen_release_ptpage(unsigned long pfn, unsigned level) -{ - struct page *page = pfn_to_page(pfn); - - if (PagePinned(page)) { - if (!PageHighMem(page)) { - if (level == PT_PTE && USE_SPLIT_PTLOCKS) - pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, pfn); - make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); - } - ClearPagePinned(page); - } -} - -static void xen_release_pte(unsigned long pfn) -{ - xen_release_ptpage(pfn, PT_PTE); -} - -static void xen_release_pmd(unsigned long pfn) -{ - xen_release_ptpage(pfn, PT_PMD); -} - -#if PAGETABLE_LEVELS == 4 -static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn) -{ - xen_alloc_ptpage(mm, pfn, PT_PUD); -} - -static void xen_release_pud(unsigned long pfn) -{ - xen_release_ptpage(pfn, PT_PUD); -} -#endif - -#ifdef CONFIG_HIGHPTE -static void *xen_kmap_atomic_pte(struct page *page, enum km_type type) -{ - pgprot_t prot = PAGE_KERNEL; - - if (PagePinned(page)) - prot = PAGE_KERNEL_RO; - - if (0 && PageHighMem(page)) - printk("mapping highpte %lx type %d prot %s\n", - page_to_pfn(page), type, - (unsigned long)pgprot_val(prot) & _PAGE_RW ? "WRITE" : "READ"); - - return kmap_atomic_prot(page, type, prot); -} -#endif - -#ifdef CONFIG_X86_32 -static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte) -{ - /* If there's an existing pte, then don't allow _PAGE_RW to be set */ - if (pte_val_ma(*ptep) & _PAGE_PRESENT) - pte = __pte_ma(((pte_val_ma(*ptep) & _PAGE_RW) | ~_PAGE_RW) & - pte_val_ma(pte)); - - return pte; -} - -/* Init-time set_pte while constructing initial pagetables, which - doesn't allow RO pagetable pages to be remapped RW */ -static __init void xen_set_pte_init(pte_t *ptep, pte_t pte) -{ - pte = mask_rw_pte(ptep, pte); - - xen_set_pte(ptep, pte); -} -#endif - -static __init void xen_pagetable_setup_start(pgd_t *base) -{ -} - void xen_setup_shared_info(void) { if (!xen_feature(XENFEAT_auto_translated_physmap)) { @@ -1028,37 +660,6 @@ void xen_setup_shared_info(void) xen_setup_mfn_list_list(); } -static __init void xen_pagetable_setup_done(pgd_t *base) -{ - xen_setup_shared_info(); -} - -static __init void xen_post_allocator_init(void) -{ - pv_mmu_ops.set_pte = xen_set_pte; - pv_mmu_ops.set_pmd = xen_set_pmd; - pv_mmu_ops.set_pud = xen_set_pud; -#if PAGETABLE_LEVELS == 4 - pv_mmu_ops.set_pgd = xen_set_pgd; -#endif - - /* This will work as long as patching hasn't happened yet - (which it hasn't) */ - pv_mmu_ops.alloc_pte = xen_alloc_pte; - pv_mmu_ops.alloc_pmd = xen_alloc_pmd; - pv_mmu_ops.release_pte = xen_release_pte; - pv_mmu_ops.release_pmd = xen_release_pmd; -#if PAGETABLE_LEVELS == 4 - pv_mmu_ops.alloc_pud = xen_alloc_pud; - pv_mmu_ops.release_pud = xen_release_pud; -#endif - -#ifdef CONFIG_X86_64 - SetPagePinned(virt_to_page(level3_user_vsyscall)); -#endif - xen_mark_init_mm_pinned(); -} - /* This is called once we have the cpu_possible_map */ void xen_setup_vcpu_info_placement(void) { @@ -1072,10 +673,10 @@ void xen_setup_vcpu_info_placement(void) if (have_vcpu_info_placement) { printk(KERN_INFO "Xen: using vcpu_info placement\n"); - pv_irq_ops.save_fl = xen_save_fl_direct; - pv_irq_ops.restore_fl = xen_restore_fl_direct; - pv_irq_ops.irq_disable = xen_irq_disable_direct; - pv_irq_ops.irq_enable = xen_irq_enable_direct; + pv_irq_ops.save_fl = __PV_IS_CALLEE_SAVE(xen_save_fl_direct); + pv_irq_ops.restore_fl = __PV_IS_CALLEE_SAVE(xen_restore_fl_direct); + pv_irq_ops.irq_disable = __PV_IS_CALLEE_SAVE(xen_irq_disable_direct); + pv_irq_ops.irq_enable = __PV_IS_CALLEE_SAVE(xen_irq_enable_direct); pv_mmu_ops.read_cr2 = xen_read_cr2_direct; } } @@ -1133,49 +734,6 @@ static unsigned xen_patch(u8 type, u16 c return ret; } -static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot) -{ - pte_t pte; - - phys >>= PAGE_SHIFT; - - switch (idx) { - case FIX_BTMAP_END ... FIX_BTMAP_BEGIN: -#ifdef CONFIG_X86_F00F_BUG - case FIX_F00F_IDT: -#endif -#ifdef CONFIG_X86_32 - case FIX_WP_TEST: - case FIX_VDSO: -# ifdef CONFIG_HIGHMEM - case FIX_KMAP_BEGIN ... FIX_KMAP_END: -# endif -#else - case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE: -#endif -#ifdef CONFIG_X86_LOCAL_APIC - case FIX_APIC_BASE: /* maps dummy local APIC */ -#endif - pte = pfn_pte(phys, prot); - break; - - default: - pte = mfn_pte(phys, prot); - break; - } - - __native_set_fixmap(idx, pte); - -#ifdef CONFIG_X86_64 - /* Replicate changes to map the vsyscall page into the user - pagetable vsyscall mapping. */ - if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) { - unsigned long vaddr = __fix_to_virt(idx); - set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte); - } -#endif -} - static const struct pv_info xen_info __initdata = { .paravirt_enabled = 1, .shared_kernel_pmd = 0, @@ -1271,87 +829,6 @@ static const struct pv_apic_ops xen_apic #endif }; -static const struct pv_mmu_ops xen_mmu_ops __initdata = { - .pagetable_setup_start = xen_pagetable_setup_start, - .pagetable_setup_done = xen_pagetable_setup_done, - - .read_cr2 = xen_read_cr2, - .write_cr2 = xen_write_cr2, - - .read_cr3 = xen_read_cr3, - .write_cr3 = xen_write_cr3, - - .flush_tlb_user = xen_flush_tlb, - .flush_tlb_kernel = xen_flush_tlb, - .flush_tlb_single = xen_flush_tlb_single, - .flush_tlb_others = xen_flush_tlb_others, - - .pte_update = paravirt_nop, - .pte_update_defer = paravirt_nop, - - .pgd_alloc = xen_pgd_alloc, - .pgd_free = xen_pgd_free, - - .alloc_pte = xen_alloc_pte_init, - .release_pte = xen_release_pte_init, - .alloc_pmd = xen_alloc_pte_init, - .alloc_pmd_clone = paravirt_nop, - .release_pmd = xen_release_pte_init, - -#ifdef CONFIG_HIGHPTE - .kmap_atomic_pte = xen_kmap_atomic_pte, -#endif - -#ifdef CONFIG_X86_64 - .set_pte = xen_set_pte, -#else - .set_pte = xen_set_pte_init, -#endif - .set_pte_at = xen_set_pte_at, - .set_pmd = xen_set_pmd_hyper, - - .ptep_modify_prot_start = __ptep_modify_prot_start, - .ptep_modify_prot_commit = __ptep_modify_prot_commit, - - .pte_val = xen_pte_val, - .pte_flags = native_pte_flags, - .pgd_val = xen_pgd_val, - - .make_pte = xen_make_pte, - .make_pgd = xen_make_pgd, - -#ifdef CONFIG_X86_PAE - .set_pte_atomic = xen_set_pte_atomic, - .set_pte_present = xen_set_pte_at, - .pte_clear = xen_pte_clear, - .pmd_clear = xen_pmd_clear, -#endif /* CONFIG_X86_PAE */ - .set_pud = xen_set_pud_hyper, - - .make_pmd = xen_make_pmd, - .pmd_val = xen_pmd_val, - -#if PAGETABLE_LEVELS == 4 - .pud_val = xen_pud_val, - .make_pud = xen_make_pud, - .set_pgd = xen_set_pgd_hyper, - - .alloc_pud = xen_alloc_pte_init, - .release_pud = xen_release_pte_init, -#endif /* PAGETABLE_LEVELS == 4 */ - - .activate_mm = xen_activate_mm, - .dup_mmap = xen_dup_mmap, - .exit_mmap = xen_exit_mmap, - - .lazy_mode = { - .enter = paravirt_enter_lazy_mmu, - .leave = xen_leave_lazy, - }, - - .set_fixmap = xen_set_fixmap, -}; - static void xen_reboot(int reason) { struct sched_shutdown r = { .reason = reason }; @@ -1394,223 +871,6 @@ static const struct machine_ops __initda }; -static void __init xen_reserve_top(void) -{ -#ifdef CONFIG_X86_32 - unsigned long top = HYPERVISOR_VIRT_START; - struct xen_platform_parameters pp; - - if (HYPERVISOR_xen_version(XENVER_platform_parameters, &pp) == 0) - top = pp.virt_start; - - reserve_top_address(-top); -#endif /* CONFIG_X86_32 */ -} - -/* - * Like __va(), but returns address in the kernel mapping (which is - * all we have until the physical memory mapping has been set up. - */ -static void *__ka(phys_addr_t paddr) -{ -#ifdef CONFIG_X86_64 - return (void *)(paddr + __START_KERNEL_map); -#else - return __va(paddr); -#endif -} - -/* Convert a machine address to physical address */ -static unsigned long m2p(phys_addr_t maddr) -{ - phys_addr_t paddr; - - maddr &= PTE_PFN_MASK; - paddr = mfn_to_pfn(maddr >> PAGE_SHIFT) << PAGE_SHIFT; - - return paddr; -} - -/* Convert a machine address to kernel virtual */ -static void *m2v(phys_addr_t maddr) -{ - return __ka(m2p(maddr)); -} - -static void set_page_prot(void *addr, pgprot_t prot) -{ - unsigned long pfn = __pa(addr) >> PAGE_SHIFT; - pte_t pte = pfn_pte(pfn, prot); - - if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0)) - BUG(); -} - -static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) -{ - unsigned pmdidx, pteidx; - unsigned ident_pte; - unsigned long pfn; - - ident_pte = 0; - pfn = 0; - for (pmdidx = 0; pmdidx < PTRS_PER_PMD && pfn < max_pfn; pmdidx++) { - pte_t *pte_page; - - /* Reuse or allocate a page of ptes */ - if (pmd_present(pmd[pmdidx])) - pte_page = m2v(pmd[pmdidx].pmd); - else { - /* Check for free pte pages */ - if (ident_pte == ARRAY_SIZE(level1_ident_pgt)) - break; - - pte_page = &level1_ident_pgt[ident_pte]; - ident_pte += PTRS_PER_PTE; - - pmd[pmdidx] = __pmd(__pa(pte_page) | _PAGE_TABLE); - } - - /* Install mappings */ - for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) { - pte_t pte; - - if (pfn > max_pfn_mapped) - max_pfn_mapped = pfn; - - if (!pte_none(pte_page[pteidx])) - continue; - - pte = pfn_pte(pfn, PAGE_KERNEL_EXEC); - pte_page[pteidx] = pte; - } - } - - for (pteidx = 0; pteidx < ident_pte; pteidx += PTRS_PER_PTE) - set_page_prot(&level1_ident_pgt[pteidx], PAGE_KERNEL_RO); - - set_page_prot(pmd, PAGE_KERNEL_RO); -} - -#ifdef CONFIG_X86_64 -static void convert_pfn_mfn(void *v) -{ - pte_t *pte = v; - int i; - - /* All levels are converted the same way, so just treat them - as ptes. */ - for (i = 0; i < PTRS_PER_PTE; i++) - pte[i] = xen_make_pte(pte[i].pte); -} - -/* - * Set up the inital kernel pagetable. - * - * We can construct this by grafting the Xen provided pagetable into - * head_64.S's preconstructed pagetables. We copy the Xen L2's into - * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt. This - * means that only the kernel has a physical mapping to start with - - * but that's enough to get __va working. We need to fill in the rest - * of the physical mapping once some sort of allocator has been set - * up. - */ -static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, - unsigned long max_pfn) -{ - pud_t *l3; - pmd_t *l2; - - /* Zap identity mapping */ - init_level4_pgt[0] = __pgd(0); - - /* Pre-constructed entries are in pfn, so convert to mfn */ - convert_pfn_mfn(init_level4_pgt); - convert_pfn_mfn(level3_ident_pgt); - convert_pfn_mfn(level3_kernel_pgt); - - l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd); - l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud); - - memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); - memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); - - l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd); - l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud); - memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); - - /* Set up identity map */ - xen_map_identity_early(level2_ident_pgt, max_pfn); - - /* Make pagetable pieces RO */ - set_page_prot(init_level4_pgt, PAGE_KERNEL_RO); - set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO); - set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO); - set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO); - set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO); - set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO); - - /* Pin down new L4 */ - pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, - PFN_DOWN(__pa_symbol(init_level4_pgt))); - - /* Unpin Xen-provided one */ - pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd))); - - /* Switch over */ - pgd = init_level4_pgt; - - /* - * At this stage there can be no user pgd, and no page - * structure to attach it to, so make sure we just set kernel - * pgd. - */ - xen_mc_batch(); - __xen_write_cr3(true, __pa(pgd)); - xen_mc_issue(PARAVIRT_LAZY_CPU); - - reserve_early(__pa(xen_start_info->pt_base), - __pa(xen_start_info->pt_base + - xen_start_info->nr_pt_frames * PAGE_SIZE), - "XEN PAGETABLES"); - - return pgd; -} -#else /* !CONFIG_X86_64 */ -static pmd_t level2_kernel_pgt[PTRS_PER_PMD] __page_aligned_bss; - -static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, - unsigned long max_pfn) -{ - pmd_t *kernel_pmd; - - init_pg_tables_start = __pa(pgd); - init_pg_tables_end = __pa(pgd) + xen_start_info->nr_pt_frames*PAGE_SIZE; - max_pfn_mapped = PFN_DOWN(init_pg_tables_end + 512*1024); - - kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd); - memcpy(level2_kernel_pgt, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD); - - xen_map_identity_early(level2_kernel_pgt, max_pfn); - - memcpy(swapper_pg_dir, pgd, sizeof(pgd_t) * PTRS_PER_PGD); - set_pgd(&swapper_pg_dir[KERNEL_PGD_BOUNDARY], - __pgd(__pa(level2_kernel_pgt) | _PAGE_PRESENT)); - - set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO); - set_page_prot(swapper_pg_dir, PAGE_KERNEL_RO); - set_page_prot(empty_zero_page, PAGE_KERNEL_RO); - - pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd))); - - xen_write_cr3(__pa(swapper_pg_dir)); - - pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(__pa(swapper_pg_dir))); - - return swapper_pg_dir; -} -#endif /* CONFIG_X86_64 */ - /* First C function to be called on Xen boot */ asmlinkage void __init xen_start_kernel(void) { @@ -1639,7 +899,7 @@ asmlinkage void __init xen_start_kernel( /* * set up the basic apic ops. */ - apic_ops = &xen_basic_apic_ops; + set_xen_basic_apic_ops(); #endif if (xen_feature(XENFEAT_mmu_pt_update_preserve_ad)) { @@ -1650,10 +910,18 @@ asmlinkage void __init xen_start_kernel( machine_ops = xen_machine_ops; #ifdef CONFIG_X86_64 - /* Disable until direct per-cpu data access. */ - have_vcpu_info_placement = 0; - x86_64_init_pda(); + /* + * Setup percpu state. We only need to do this for 64-bit + * because 32-bit already has %fs set properly. + */ + load_percpu_segment(0); #endif + /* + * The only reliable way to retain the initial address of the + * percpu gdt_page is to remember it here, so we can go and + * mark it RW later, when the initial percpu area is freed. + */ + xen_initial_gdt = &per_cpu(gdt_page, 0); xen_smp_init(); Index: linux-2.6-tip/arch/x86/xen/irq.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/irq.c +++ linux-2.6-tip/arch/x86/xen/irq.c @@ -19,27 +19,12 @@ void xen_force_evtchn_callback(void) (void)HYPERVISOR_xen_version(0, NULL); } -static void __init __xen_init_IRQ(void) -{ - int i; - - /* Create identity vector->irq map */ - for(i = 0; i < NR_VECTORS; i++) { - int cpu; - - for_each_possible_cpu(cpu) - per_cpu(vector_irq, cpu)[i] = i; - } - - xen_init_IRQ(); -} - static unsigned long xen_save_fl(void) { struct vcpu_info *vcpu; unsigned long flags; - vcpu = x86_read_percpu(xen_vcpu); + vcpu = percpu_read(xen_vcpu); /* flag has opposite sense of mask */ flags = !vcpu->evtchn_upcall_mask; @@ -50,6 +35,7 @@ static unsigned long xen_save_fl(void) */ return (-flags) & X86_EFLAGS_IF; } +PV_CALLEE_SAVE_REGS_THUNK(xen_save_fl); static void xen_restore_fl(unsigned long flags) { @@ -62,7 +48,7 @@ static void xen_restore_fl(unsigned long make sure we're don't switch CPUs between getting the vcpu pointer and updating the mask. */ preempt_disable(); - vcpu = x86_read_percpu(xen_vcpu); + vcpu = percpu_read(xen_vcpu); vcpu->evtchn_upcall_mask = flags; preempt_enable_no_resched(); @@ -76,6 +62,7 @@ static void xen_restore_fl(unsigned long xen_force_evtchn_callback(); } } +PV_CALLEE_SAVE_REGS_THUNK(xen_restore_fl); static void xen_irq_disable(void) { @@ -83,9 +70,10 @@ static void xen_irq_disable(void) make sure we're don't switch CPUs between getting the vcpu pointer and updating the mask. */ preempt_disable(); - x86_read_percpu(xen_vcpu)->evtchn_upcall_mask = 1; + percpu_read(xen_vcpu)->evtchn_upcall_mask = 1; preempt_enable_no_resched(); } +PV_CALLEE_SAVE_REGS_THUNK(xen_irq_disable); static void xen_irq_enable(void) { @@ -96,7 +84,7 @@ static void xen_irq_enable(void) the caller is confused and is trying to re-enable interrupts on an indeterminate processor. */ - vcpu = x86_read_percpu(xen_vcpu); + vcpu = percpu_read(xen_vcpu); vcpu->evtchn_upcall_mask = 0; /* Doesn't matter if we get preempted here, because any @@ -106,6 +94,7 @@ static void xen_irq_enable(void) if (unlikely(vcpu->evtchn_upcall_pending)) xen_force_evtchn_callback(); } +PV_CALLEE_SAVE_REGS_THUNK(xen_irq_enable); static void xen_safe_halt(void) { @@ -123,11 +112,13 @@ static void xen_halt(void) } static const struct pv_irq_ops xen_irq_ops __initdata = { - .init_IRQ = __xen_init_IRQ, - .save_fl = xen_save_fl, - .restore_fl = xen_restore_fl, - .irq_disable = xen_irq_disable, - .irq_enable = xen_irq_enable, + .init_IRQ = xen_init_IRQ, + + .save_fl = PV_CALLEE_SAVE(xen_save_fl), + .restore_fl = PV_CALLEE_SAVE(xen_restore_fl), + .irq_disable = PV_CALLEE_SAVE(xen_irq_disable), + .irq_enable = PV_CALLEE_SAVE(xen_irq_enable), + .safe_halt = xen_safe_halt, .halt = xen_halt, #ifdef CONFIG_X86_64 Index: linux-2.6-tip/arch/x86/xen/mmu.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/mmu.c +++ linux-2.6-tip/arch/x86/xen/mmu.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -55,6 +56,8 @@ #include #include +#include +#include #include "multicalls.h" #include "mmu.h" @@ -114,6 +117,37 @@ static inline void check_zero(void) #endif /* CONFIG_XEN_DEBUG_FS */ + +/* + * Identity map, in addition to plain kernel map. This needs to be + * large enough to allocate page table pages to allocate the rest. + * Each page can map 2MB. + */ +static pte_t level1_ident_pgt[PTRS_PER_PTE * 4] __page_aligned_bss; + +#ifdef CONFIG_X86_64 +/* l3 pud for userspace vsyscall mapping */ +static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss; +#endif /* CONFIG_X86_64 */ + +/* + * Note about cr3 (pagetable base) values: + * + * xen_cr3 contains the current logical cr3 value; it contains the + * last set cr3. This may not be the current effective cr3, because + * its update may be being lazily deferred. However, a vcpu looking + * at its own cr3 can use this value knowing that it everything will + * be self-consistent. + * + * xen_current_cr3 contains the actual vcpu cr3; it is set once the + * hypercall to set the vcpu cr3 is complete (so it may be a little + * out of date, but it will never be set early). If one vcpu is + * looking at another vcpu's cr3 value, it should use this variable. + */ +DEFINE_PER_CPU(unsigned long, xen_cr3); /* cr3 stored as physaddr */ +DEFINE_PER_CPU(unsigned long, xen_current_cr3); /* actual vcpu cr3 */ + + /* * Just beyond the highest usermode address. STACK_TOP_MAX has a * redzone above it, so round it up to a PGD boundary. @@ -458,28 +492,33 @@ pteval_t xen_pte_val(pte_t pte) { return pte_mfn_to_pfn(pte.pte); } +PV_CALLEE_SAVE_REGS_THUNK(xen_pte_val); pgdval_t xen_pgd_val(pgd_t pgd) { return pte_mfn_to_pfn(pgd.pgd); } +PV_CALLEE_SAVE_REGS_THUNK(xen_pgd_val); pte_t xen_make_pte(pteval_t pte) { pte = pte_pfn_to_mfn(pte); return native_make_pte(pte); } +PV_CALLEE_SAVE_REGS_THUNK(xen_make_pte); pgd_t xen_make_pgd(pgdval_t pgd) { pgd = pte_pfn_to_mfn(pgd); return native_make_pgd(pgd); } +PV_CALLEE_SAVE_REGS_THUNK(xen_make_pgd); pmdval_t xen_pmd_val(pmd_t pmd) { return pte_mfn_to_pfn(pmd.pmd); } +PV_CALLEE_SAVE_REGS_THUNK(xen_pmd_val); void xen_set_pud_hyper(pud_t *ptr, pud_t val) { @@ -556,12 +595,14 @@ pmd_t xen_make_pmd(pmdval_t pmd) pmd = pte_pfn_to_mfn(pmd); return native_make_pmd(pmd); } +PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd); #if PAGETABLE_LEVELS == 4 pudval_t xen_pud_val(pud_t pud) { return pte_mfn_to_pfn(pud.pud); } +PV_CALLEE_SAVE_REGS_THUNK(xen_pud_val); pud_t xen_make_pud(pudval_t pud) { @@ -569,6 +610,7 @@ pud_t xen_make_pud(pudval_t pud) return native_make_pud(pud); } +PV_CALLEE_SAVE_REGS_THUNK(xen_make_pud); pgd_t *xen_get_user_pgd(pgd_t *pgd) { @@ -1063,18 +1105,14 @@ static void drop_other_mm_ref(void *info struct mm_struct *mm = info; struct mm_struct *active_mm; -#ifdef CONFIG_X86_64 - active_mm = read_pda(active_mm); -#else - active_mm = __get_cpu_var(cpu_tlbstate).active_mm; -#endif + active_mm = percpu_read(cpu_tlbstate.active_mm); if (active_mm == mm) leave_mm(smp_processor_id()); /* If this cpu still has a stale cr3 reference, then make sure it has been flushed. */ - if (x86_read_percpu(xen_current_cr3) == __pa(mm->pgd)) { + if (percpu_read(xen_current_cr3) == __pa(mm->pgd)) { load_cr3(swapper_pg_dir); arch_flush_lazy_cpu_mode(); } @@ -1156,6 +1194,706 @@ void xen_exit_mmap(struct mm_struct *mm) spin_unlock(&mm->page_table_lock); } +static __init void xen_pagetable_setup_start(pgd_t *base) +{ +} + +static __init void xen_pagetable_setup_done(pgd_t *base) +{ + xen_setup_shared_info(); +} + +static void xen_write_cr2(unsigned long cr2) +{ + percpu_read(xen_vcpu)->arch.cr2 = cr2; +} + +static unsigned long xen_read_cr2(void) +{ + return percpu_read(xen_vcpu)->arch.cr2; +} + +unsigned long xen_read_cr2_direct(void) +{ + return percpu_read(xen_vcpu_info.arch.cr2); +} + +static void xen_flush_tlb(void) +{ + struct mmuext_op *op; + struct multicall_space mcs; + + preempt_disable(); + + mcs = xen_mc_entry(sizeof(*op)); + + op = mcs.args; + op->cmd = MMUEXT_TLB_FLUSH_LOCAL; + MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); + + preempt_enable(); +} + +static void xen_flush_tlb_single(unsigned long addr) +{ + struct mmuext_op *op; + struct multicall_space mcs; + + preempt_disable(); + + mcs = xen_mc_entry(sizeof(*op)); + op = mcs.args; + op->cmd = MMUEXT_INVLPG_LOCAL; + op->arg1.linear_addr = addr & PAGE_MASK; + MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); + + preempt_enable(); +} + +static void xen_flush_tlb_others(const struct cpumask *cpus, + struct mm_struct *mm, unsigned long va) +{ + struct { + struct mmuext_op op; + DECLARE_BITMAP(mask, NR_CPUS); + } *args; + struct multicall_space mcs; + + BUG_ON(cpumask_empty(cpus)); + BUG_ON(!mm); + + mcs = xen_mc_entry(sizeof(*args)); + args = mcs.args; + args->op.arg2.vcpumask = to_cpumask(args->mask); + + /* Remove us, and any offline CPUS. */ + cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask)); + + if (va == TLB_FLUSH_ALL) { + args->op.cmd = MMUEXT_TLB_FLUSH_MULTI; + } else { + args->op.cmd = MMUEXT_INVLPG_MULTI; + args->op.arg1.linear_addr = va; + } + + MULTI_mmuext_op(mcs.mc, &args->op, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); +} + +static unsigned long xen_read_cr3(void) +{ + return percpu_read(xen_cr3); +} + +static void set_current_cr3(void *v) +{ + percpu_write(xen_current_cr3, (unsigned long)v); +} + +static void __xen_write_cr3(bool kernel, unsigned long cr3) +{ + struct mmuext_op *op; + struct multicall_space mcs; + unsigned long mfn; + + if (cr3) + mfn = pfn_to_mfn(PFN_DOWN(cr3)); + else + mfn = 0; + + WARN_ON(mfn == 0 && kernel); + + mcs = __xen_mc_entry(sizeof(*op)); + + op = mcs.args; + op->cmd = kernel ? MMUEXT_NEW_BASEPTR : MMUEXT_NEW_USER_BASEPTR; + op->arg1.mfn = mfn; + + MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + + if (kernel) { + percpu_write(xen_cr3, cr3); + + /* Update xen_current_cr3 once the batch has actually + been submitted. */ + xen_mc_callback(set_current_cr3, (void *)cr3); + } +} + +static void xen_write_cr3(unsigned long cr3) +{ + BUG_ON(preemptible()); + + xen_mc_batch(); /* disables interrupts */ + + /* Update while interrupts are disabled, so its atomic with + respect to ipis */ + percpu_write(xen_cr3, cr3); + + __xen_write_cr3(true, cr3); + +#ifdef CONFIG_X86_64 + { + pgd_t *user_pgd = xen_get_user_pgd(__va(cr3)); + if (user_pgd) + __xen_write_cr3(false, __pa(user_pgd)); + else + __xen_write_cr3(false, 0); + } +#endif + + xen_mc_issue(PARAVIRT_LAZY_CPU); /* interrupts restored */ +} + +static int xen_pgd_alloc(struct mm_struct *mm) +{ + pgd_t *pgd = mm->pgd; + int ret = 0; + + BUG_ON(PagePinned(virt_to_page(pgd))); + +#ifdef CONFIG_X86_64 + { + struct page *page = virt_to_page(pgd); + pgd_t *user_pgd; + + BUG_ON(page->private != 0); + + ret = -ENOMEM; + + user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + page->private = (unsigned long)user_pgd; + + if (user_pgd != NULL) { + user_pgd[pgd_index(VSYSCALL_START)] = + __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE); + ret = 0; + } + + BUG_ON(PagePinned(virt_to_page(xen_get_user_pgd(pgd)))); + } +#endif + + return ret; +} + +static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd) +{ +#ifdef CONFIG_X86_64 + pgd_t *user_pgd = xen_get_user_pgd(pgd); + + if (user_pgd) + free_page((unsigned long)user_pgd); +#endif +} + +#ifdef CONFIG_HIGHPTE +static void *xen_kmap_atomic_pte(struct page *page, enum km_type type) +{ + pgprot_t prot = PAGE_KERNEL; + + if (PagePinned(page)) + prot = PAGE_KERNEL_RO; + + if (0 && PageHighMem(page)) + printk("mapping highpte %lx type %d prot %s\n", + page_to_pfn(page), type, + (unsigned long)pgprot_val(prot) & _PAGE_RW ? "WRITE" : "READ"); + + return kmap_atomic_prot(page, type, prot); +} +#endif + +#ifdef CONFIG_X86_32 +static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte) +{ + /* If there's an existing pte, then don't allow _PAGE_RW to be set */ + if (pte_val_ma(*ptep) & _PAGE_PRESENT) + pte = __pte_ma(((pte_val_ma(*ptep) & _PAGE_RW) | ~_PAGE_RW) & + pte_val_ma(pte)); + + return pte; +} + +/* Init-time set_pte while constructing initial pagetables, which + doesn't allow RO pagetable pages to be remapped RW */ +static __init void xen_set_pte_init(pte_t *ptep, pte_t pte) +{ + pte = mask_rw_pte(ptep, pte); + + xen_set_pte(ptep, pte); +} +#endif + +/* Early in boot, while setting up the initial pagetable, assume + everything is pinned. */ +static __init void xen_alloc_pte_init(struct mm_struct *mm, unsigned long pfn) +{ +#ifdef CONFIG_FLATMEM + BUG_ON(mem_map); /* should only be used early */ +#endif + make_lowmem_page_readonly(__va(PFN_PHYS(pfn))); +} + +/* Early release_pte assumes that all pts are pinned, since there's + only init_mm and anything attached to that is pinned. */ +static void xen_release_pte_init(unsigned long pfn) +{ + make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); +} + +static void pin_pagetable_pfn(unsigned cmd, unsigned long pfn) +{ + struct mmuext_op op; + op.cmd = cmd; + op.arg1.mfn = pfn_to_mfn(pfn); + if (HYPERVISOR_mmuext_op(&op, 1, NULL, DOMID_SELF)) + BUG(); +} + +/* This needs to make sure the new pte page is pinned iff its being + attached to a pinned pagetable. */ +static void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn, unsigned level) +{ + struct page *page = pfn_to_page(pfn); + + if (PagePinned(virt_to_page(mm->pgd))) { + SetPagePinned(page); + + vm_unmap_aliases(); + if (!PageHighMem(page)) { + make_lowmem_page_readonly(__va(PFN_PHYS((unsigned long)pfn))); + if (level == PT_PTE && USE_SPLIT_PTLOCKS) + pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn); + } else { + /* make sure there are no stray mappings of + this page */ + kmap_flush_unused(); + } + } +} + +static void xen_alloc_pte(struct mm_struct *mm, unsigned long pfn) +{ + xen_alloc_ptpage(mm, pfn, PT_PTE); +} + +static void xen_alloc_pmd(struct mm_struct *mm, unsigned long pfn) +{ + xen_alloc_ptpage(mm, pfn, PT_PMD); +} + +/* This should never happen until we're OK to use struct page */ +static void xen_release_ptpage(unsigned long pfn, unsigned level) +{ + struct page *page = pfn_to_page(pfn); + + if (PagePinned(page)) { + if (!PageHighMem(page)) { + if (level == PT_PTE && USE_SPLIT_PTLOCKS) + pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, pfn); + make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); + } + ClearPagePinned(page); + } +} + +static void xen_release_pte(unsigned long pfn) +{ + xen_release_ptpage(pfn, PT_PTE); +} + +static void xen_release_pmd(unsigned long pfn) +{ + xen_release_ptpage(pfn, PT_PMD); +} + +#if PAGETABLE_LEVELS == 4 +static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn) +{ + xen_alloc_ptpage(mm, pfn, PT_PUD); +} + +static void xen_release_pud(unsigned long pfn) +{ + xen_release_ptpage(pfn, PT_PUD); +} +#endif + +void __init xen_reserve_top(void) +{ +#ifdef CONFIG_X86_32 + unsigned long top = HYPERVISOR_VIRT_START; + struct xen_platform_parameters pp; + + if (HYPERVISOR_xen_version(XENVER_platform_parameters, &pp) == 0) + top = pp.virt_start; + + reserve_top_address(-top); +#endif /* CONFIG_X86_32 */ +} + +/* + * Like __va(), but returns address in the kernel mapping (which is + * all we have until the physical memory mapping has been set up. + */ +static void *__ka(phys_addr_t paddr) +{ +#ifdef CONFIG_X86_64 + return (void *)(paddr + __START_KERNEL_map); +#else + return __va(paddr); +#endif +} + +/* Convert a machine address to physical address */ +static unsigned long m2p(phys_addr_t maddr) +{ + phys_addr_t paddr; + + maddr &= PTE_PFN_MASK; + paddr = mfn_to_pfn(maddr >> PAGE_SHIFT) << PAGE_SHIFT; + + return paddr; +} + +/* Convert a machine address to kernel virtual */ +static void *m2v(phys_addr_t maddr) +{ + return __ka(m2p(maddr)); +} + +static void set_page_prot(void *addr, pgprot_t prot) +{ + unsigned long pfn = __pa(addr) >> PAGE_SHIFT; + pte_t pte = pfn_pte(pfn, prot); + + if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0)) + BUG(); +} + +static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) +{ + unsigned pmdidx, pteidx; + unsigned ident_pte; + unsigned long pfn; + + ident_pte = 0; + pfn = 0; + for (pmdidx = 0; pmdidx < PTRS_PER_PMD && pfn < max_pfn; pmdidx++) { + pte_t *pte_page; + + /* Reuse or allocate a page of ptes */ + if (pmd_present(pmd[pmdidx])) + pte_page = m2v(pmd[pmdidx].pmd); + else { + /* Check for free pte pages */ + if (ident_pte == ARRAY_SIZE(level1_ident_pgt)) + break; + + pte_page = &level1_ident_pgt[ident_pte]; + ident_pte += PTRS_PER_PTE; + + pmd[pmdidx] = __pmd(__pa(pte_page) | _PAGE_TABLE); + } + + /* Install mappings */ + for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) { + pte_t pte; + + if (pfn > max_pfn_mapped) + max_pfn_mapped = pfn; + + if (!pte_none(pte_page[pteidx])) + continue; + + pte = pfn_pte(pfn, PAGE_KERNEL_EXEC); + pte_page[pteidx] = pte; + } + } + + for (pteidx = 0; pteidx < ident_pte; pteidx += PTRS_PER_PTE) + set_page_prot(&level1_ident_pgt[pteidx], PAGE_KERNEL_RO); + + set_page_prot(pmd, PAGE_KERNEL_RO); +} + +#ifdef CONFIG_X86_64 +static void convert_pfn_mfn(void *v) +{ + pte_t *pte = v; + int i; + + /* All levels are converted the same way, so just treat them + as ptes. */ + for (i = 0; i < PTRS_PER_PTE; i++) + pte[i] = xen_make_pte(pte[i].pte); +} + +/* + * Set up the inital kernel pagetable. + * + * We can construct this by grafting the Xen provided pagetable into + * head_64.S's preconstructed pagetables. We copy the Xen L2's into + * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt. This + * means that only the kernel has a physical mapping to start with - + * but that's enough to get __va working. We need to fill in the rest + * of the physical mapping once some sort of allocator has been set + * up. + */ +__init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, + unsigned long max_pfn) +{ + pud_t *l3; + pmd_t *l2; + + /* Zap identity mapping */ + init_level4_pgt[0] = __pgd(0); + + /* Pre-constructed entries are in pfn, so convert to mfn */ + convert_pfn_mfn(init_level4_pgt); + convert_pfn_mfn(level3_ident_pgt); + convert_pfn_mfn(level3_kernel_pgt); + + l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd); + l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud); + + memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + + l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd); + l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud); + memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + + /* Set up identity map */ + xen_map_identity_early(level2_ident_pgt, max_pfn); + + /* Make pagetable pieces RO */ + set_page_prot(init_level4_pgt, PAGE_KERNEL_RO); + set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO); + set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO); + set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO); + set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO); + set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO); + + /* Pin down new L4 */ + pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, + PFN_DOWN(__pa_symbol(init_level4_pgt))); + + /* Unpin Xen-provided one */ + pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd))); + + /* Switch over */ + pgd = init_level4_pgt; + + /* + * At this stage there can be no user pgd, and no page + * structure to attach it to, so make sure we just set kernel + * pgd. + */ + xen_mc_batch(); + __xen_write_cr3(true, __pa(pgd)); + xen_mc_issue(PARAVIRT_LAZY_CPU); + + reserve_early(__pa(xen_start_info->pt_base), + __pa(xen_start_info->pt_base + + xen_start_info->nr_pt_frames * PAGE_SIZE), + "XEN PAGETABLES"); + + return pgd; +} +#else /* !CONFIG_X86_64 */ +static pmd_t level2_kernel_pgt[PTRS_PER_PMD] __page_aligned_bss; + +__init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, + unsigned long max_pfn) +{ + pmd_t *kernel_pmd; + + init_pg_tables_start = __pa(pgd); + init_pg_tables_end = __pa(pgd) + xen_start_info->nr_pt_frames*PAGE_SIZE; + max_pfn_mapped = PFN_DOWN(init_pg_tables_end + 512*1024); + + kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd); + memcpy(level2_kernel_pgt, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD); + + xen_map_identity_early(level2_kernel_pgt, max_pfn); + + memcpy(swapper_pg_dir, pgd, sizeof(pgd_t) * PTRS_PER_PGD); + set_pgd(&swapper_pg_dir[KERNEL_PGD_BOUNDARY], + __pgd(__pa(level2_kernel_pgt) | _PAGE_PRESENT)); + + set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO); + set_page_prot(swapper_pg_dir, PAGE_KERNEL_RO); + set_page_prot(empty_zero_page, PAGE_KERNEL_RO); + + pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd))); + + xen_write_cr3(__pa(swapper_pg_dir)); + + pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(__pa(swapper_pg_dir))); + + return swapper_pg_dir; +} +#endif /* CONFIG_X86_64 */ + +static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot) +{ + pte_t pte; + + phys >>= PAGE_SHIFT; + + switch (idx) { + case FIX_BTMAP_END ... FIX_BTMAP_BEGIN: +#ifdef CONFIG_X86_F00F_BUG + case FIX_F00F_IDT: +#endif +#ifdef CONFIG_X86_32 + case FIX_WP_TEST: + case FIX_VDSO: +# ifdef CONFIG_HIGHMEM + case FIX_KMAP_BEGIN ... FIX_KMAP_END: +# endif +#else + case VSYSCALL_LAST_PAGE ... VSYSCALL_FIRST_PAGE: +#endif +#ifdef CONFIG_X86_LOCAL_APIC + case FIX_APIC_BASE: /* maps dummy local APIC */ +#endif + pte = pfn_pte(phys, prot); + break; + + default: + pte = mfn_pte(phys, prot); + break; + } + + __native_set_fixmap(idx, pte); + +#ifdef CONFIG_X86_64 + /* Replicate changes to map the vsyscall page into the user + pagetable vsyscall mapping. */ + if (idx >= VSYSCALL_LAST_PAGE && idx <= VSYSCALL_FIRST_PAGE) { + unsigned long vaddr = __fix_to_virt(idx); + set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte); + } +#endif +} + +__init void xen_post_allocator_init(void) +{ + pv_mmu_ops.set_pte = xen_set_pte; + pv_mmu_ops.set_pmd = xen_set_pmd; + pv_mmu_ops.set_pud = xen_set_pud; +#if PAGETABLE_LEVELS == 4 + pv_mmu_ops.set_pgd = xen_set_pgd; +#endif + + /* This will work as long as patching hasn't happened yet + (which it hasn't) */ + pv_mmu_ops.alloc_pte = xen_alloc_pte; + pv_mmu_ops.alloc_pmd = xen_alloc_pmd; + pv_mmu_ops.release_pte = xen_release_pte; + pv_mmu_ops.release_pmd = xen_release_pmd; +#if PAGETABLE_LEVELS == 4 + pv_mmu_ops.alloc_pud = xen_alloc_pud; + pv_mmu_ops.release_pud = xen_release_pud; +#endif + +#ifdef CONFIG_X86_64 + SetPagePinned(virt_to_page(level3_user_vsyscall)); +#endif + xen_mark_init_mm_pinned(); +} + + +const struct pv_mmu_ops xen_mmu_ops __initdata = { + .pagetable_setup_start = xen_pagetable_setup_start, + .pagetable_setup_done = xen_pagetable_setup_done, + + .read_cr2 = xen_read_cr2, + .write_cr2 = xen_write_cr2, + + .read_cr3 = xen_read_cr3, + .write_cr3 = xen_write_cr3, + + .flush_tlb_user = xen_flush_tlb, + .flush_tlb_kernel = xen_flush_tlb, + .flush_tlb_single = xen_flush_tlb_single, + .flush_tlb_others = xen_flush_tlb_others, + + .pte_update = paravirt_nop, + .pte_update_defer = paravirt_nop, + + .pgd_alloc = xen_pgd_alloc, + .pgd_free = xen_pgd_free, + + .alloc_pte = xen_alloc_pte_init, + .release_pte = xen_release_pte_init, + .alloc_pmd = xen_alloc_pte_init, + .alloc_pmd_clone = paravirt_nop, + .release_pmd = xen_release_pte_init, + +#ifdef CONFIG_HIGHPTE + .kmap_atomic_pte = xen_kmap_atomic_pte, +#endif + +#ifdef CONFIG_X86_64 + .set_pte = xen_set_pte, +#else + .set_pte = xen_set_pte_init, +#endif + .set_pte_at = xen_set_pte_at, + .set_pmd = xen_set_pmd_hyper, + + .ptep_modify_prot_start = __ptep_modify_prot_start, + .ptep_modify_prot_commit = __ptep_modify_prot_commit, + + .pte_val = PV_CALLEE_SAVE(xen_pte_val), + .pgd_val = PV_CALLEE_SAVE(xen_pgd_val), + + .make_pte = PV_CALLEE_SAVE(xen_make_pte), + .make_pgd = PV_CALLEE_SAVE(xen_make_pgd), + +#ifdef CONFIG_X86_PAE + .set_pte_atomic = xen_set_pte_atomic, + .set_pte_present = xen_set_pte_at, + .pte_clear = xen_pte_clear, + .pmd_clear = xen_pmd_clear, +#endif /* CONFIG_X86_PAE */ + .set_pud = xen_set_pud_hyper, + + .make_pmd = PV_CALLEE_SAVE(xen_make_pmd), + .pmd_val = PV_CALLEE_SAVE(xen_pmd_val), + +#if PAGETABLE_LEVELS == 4 + .pud_val = PV_CALLEE_SAVE(xen_pud_val), + .make_pud = PV_CALLEE_SAVE(xen_make_pud), + .set_pgd = xen_set_pgd_hyper, + + .alloc_pud = xen_alloc_pte_init, + .release_pud = xen_release_pte_init, +#endif /* PAGETABLE_LEVELS == 4 */ + + .activate_mm = xen_activate_mm, + .dup_mmap = xen_dup_mmap, + .exit_mmap = xen_exit_mmap, + + .lazy_mode = { + .enter = paravirt_enter_lazy_mmu, + .leave = xen_leave_lazy, + }, + + .set_fixmap = xen_set_fixmap, +}; + + #ifdef CONFIG_XEN_DEBUG_FS static struct dentry *d_mmu_debug; Index: linux-2.6-tip/arch/x86/xen/mmu.h =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/mmu.h +++ linux-2.6-tip/arch/x86/xen/mmu.h @@ -54,4 +54,7 @@ pte_t xen_ptep_modify_prot_start(struct void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); +unsigned long xen_read_cr2_direct(void); + +extern const struct pv_mmu_ops xen_mmu_ops; #endif /* _XEN_MMU_H */ Index: linux-2.6-tip/arch/x86/xen/multicalls.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/multicalls.c +++ linux-2.6-tip/arch/x86/xen/multicalls.c @@ -39,6 +39,7 @@ struct mc_buffer { struct multicall_entry entries[MC_BATCH]; #if MC_DEBUG struct multicall_entry debug[MC_BATCH]; + void *caller[MC_BATCH]; #endif unsigned char args[MC_ARGS]; struct callback { @@ -154,11 +155,12 @@ void xen_mc_flush(void) ret, smp_processor_id()); dump_stack(); for (i = 0; i < b->mcidx; i++) { - printk(KERN_DEBUG " call %2d/%d: op=%lu arg=[%lx] result=%ld\n", + printk(KERN_DEBUG " call %2d/%d: op=%lu arg=[%lx] result=%ld\t%pF\n", i+1, b->mcidx, b->debug[i].op, b->debug[i].args[0], - b->entries[i].result); + b->entries[i].result, + b->caller[i]); } } #endif @@ -168,8 +170,6 @@ void xen_mc_flush(void) } else BUG_ON(b->argidx != 0); - local_irq_restore(flags); - for (i = 0; i < b->cbidx; i++) { struct callback *cb = &b->callbacks[i]; @@ -177,7 +177,9 @@ void xen_mc_flush(void) } b->cbidx = 0; - BUG_ON(ret); + local_irq_restore(flags); + + WARN_ON(ret); } struct multicall_space __xen_mc_entry(size_t args) @@ -197,6 +199,9 @@ struct multicall_space __xen_mc_entry(si } ret.mc = &b->entries[b->mcidx]; +#ifdef MC_DEBUG + b->caller[b->mcidx] = __builtin_return_address(0); +#endif b->mcidx++; ret.args = &b->args[argidx]; b->argidx = argidx + args; Index: linux-2.6-tip/arch/x86/xen/multicalls.h =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/multicalls.h +++ linux-2.6-tip/arch/x86/xen/multicalls.h @@ -41,7 +41,7 @@ static inline void xen_mc_issue(unsigned xen_mc_flush(); /* restore flags saved in xen_mc_batch */ - local_irq_restore(x86_read_percpu(xen_mc_irq_flags)); + local_irq_restore(percpu_read(xen_mc_irq_flags)); } /* Set up a callback to be called when the current batch is flushed */ Index: linux-2.6-tip/arch/x86/xen/smp.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/smp.c +++ linux-2.6-tip/arch/x86/xen/smp.c @@ -50,11 +50,7 @@ static irqreturn_t xen_call_function_sin */ static irqreturn_t xen_reschedule_interrupt(int irq, void *dev_id) { -#ifdef CONFIG_X86_32 - __get_cpu_var(irq_stat).irq_resched_count++; -#else - add_pda(irq_resched_count, 1); -#endif + inc_irq_stat(irq_resched_count); return IRQ_HANDLED; } @@ -78,7 +74,7 @@ static __cpuinit void cpu_bringup(void) xen_setup_cpu_clockevents(); cpu_set(cpu, cpu_online_map); - x86_write_percpu(cpu_state, CPU_ONLINE); + percpu_write(cpu_state, CPU_ONLINE); wmb(); /* We can take interrupts now: we're officially "up". */ @@ -174,7 +170,7 @@ static void __init xen_smp_prepare_boot_ /* We've switched to the "real" per-cpu gdt, so make sure the old memory can be recycled */ - make_lowmem_page_readwrite(&per_cpu_var(gdt_page)); + make_lowmem_page_readwrite(xen_initial_gdt); xen_setup_vcpu_info_placement(); } @@ -239,6 +235,8 @@ cpu_initialize_context(unsigned int cpu, ctxt->user_regs.ss = __KERNEL_DS; #ifdef CONFIG_X86_32 ctxt->user_regs.fs = __KERNEL_PERCPU; +#else + ctxt->gs_base_kernel = per_cpu_offset(cpu); #endif ctxt->user_regs.eip = (unsigned long)cpu_bringup_and_idle; ctxt->user_regs.eflags = 0x1000; /* IOPL_RING1 */ @@ -283,23 +281,14 @@ static int __cpuinit xen_cpu_up(unsigned struct task_struct *idle = idle_task(cpu); int rc; -#ifdef CONFIG_X86_64 - /* Allocate node local memory for AP pdas */ - WARN_ON(cpu == 0); - if (cpu > 0) { - rc = get_local_pda(cpu); - if (rc) - return rc; - } -#endif - -#ifdef CONFIG_X86_32 - init_gdt(cpu); per_cpu(current_task, cpu) = idle; +#ifdef CONFIG_X86_32 irq_ctx_init(cpu); #else - cpu_pda(cpu)->pcurrent = idle; clear_tsk_thread_flag(idle, TIF_FORK); + per_cpu(kernel_stack, cpu) = + (unsigned long)task_stack_page(idle) - + KERNEL_STACK_OFFSET + THREAD_SIZE; #endif xen_setup_timer(cpu); xen_init_lock_cpu(cpu); @@ -445,11 +434,7 @@ static irqreturn_t xen_call_function_int { irq_enter(); generic_smp_call_function_interrupt(); -#ifdef CONFIG_X86_32 - __get_cpu_var(irq_stat).irq_call_count++; -#else - add_pda(irq_call_count, 1); -#endif + inc_irq_stat(irq_call_count); irq_exit(); return IRQ_HANDLED; @@ -459,11 +444,7 @@ static irqreturn_t xen_call_function_sin { irq_enter(); generic_smp_call_function_single_interrupt(); -#ifdef CONFIG_X86_32 - __get_cpu_var(irq_stat).irq_call_count++; -#else - add_pda(irq_call_count, 1); -#endif + inc_irq_stat(irq_call_count); irq_exit(); return IRQ_HANDLED; Index: linux-2.6-tip/arch/x86/xen/suspend.c =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/suspend.c +++ linux-2.6-tip/arch/x86/xen/suspend.c @@ -6,6 +6,7 @@ #include #include +#include #include "xen-ops.h" #include "mmu.h" Index: linux-2.6-tip/arch/x86/xen/xen-asm.S =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/xen/xen-asm.S @@ -0,0 +1,142 @@ +/* + * Asm versions of Xen pv-ops, suitable for either direct use or + * inlining. The inline versions are the same as the direct-use + * versions, with the pre- and post-amble chopped off. + * + * This code is encoded for size rather than absolute efficiency, with + * a view to being able to inline as much as possible. + * + * We only bother with direct forms (ie, vcpu in percpu data) of the + * operations here; the indirect forms are better handled in C, since + * they're generally too large to inline anyway. + */ + +#include +#include +#include + +#include "xen-asm.h" + +/* + * Enable events. This clears the event mask and tests the pending + * event status with one and operation. If there are pending events, + * then enter the hypervisor to get them handled. + */ +ENTRY(xen_irq_enable_direct) + /* Unmask events */ + movb $0, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask + + /* + * Preempt here doesn't matter because that will deal with any + * pending interrupts. The pending check may end up being run + * on the wrong CPU, but that doesn't hurt. + */ + + /* Test for pending */ + testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending + jz 1f + +2: call check_events +1: +ENDPATCH(xen_irq_enable_direct) + ret + ENDPROC(xen_irq_enable_direct) + RELOC(xen_irq_enable_direct, 2b+1) + + +/* + * Disabling events is simply a matter of making the event mask + * non-zero. + */ +ENTRY(xen_irq_disable_direct) + movb $1, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask +ENDPATCH(xen_irq_disable_direct) + ret + ENDPROC(xen_irq_disable_direct) + RELOC(xen_irq_disable_direct, 0) + +/* + * (xen_)save_fl is used to get the current interrupt enable status. + * Callers expect the status to be in X86_EFLAGS_IF, and other bits + * may be set in the return value. We take advantage of this by + * making sure that X86_EFLAGS_IF has the right value (and other bits + * in that byte are 0), but other bits in the return value are + * undefined. We need to toggle the state of the bit, because Xen and + * x86 use opposite senses (mask vs enable). + */ +ENTRY(xen_save_fl_direct) + testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask + setz %ah + addb %ah, %ah +ENDPATCH(xen_save_fl_direct) + ret + ENDPROC(xen_save_fl_direct) + RELOC(xen_save_fl_direct, 0) + + +/* + * In principle the caller should be passing us a value return from + * xen_save_fl_direct, but for robustness sake we test only the + * X86_EFLAGS_IF flag rather than the whole byte. After setting the + * interrupt mask state, it checks for unmasked pending events and + * enters the hypervisor to get them delivered if so. + */ +ENTRY(xen_restore_fl_direct) +#ifdef CONFIG_X86_64 + testw $X86_EFLAGS_IF, %di +#else + testb $X86_EFLAGS_IF>>8, %ah +#endif + setz PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask + /* + * Preempt here doesn't matter because that will deal with any + * pending interrupts. The pending check may end up being run + * on the wrong CPU, but that doesn't hurt. + */ + + /* check for unmasked and pending */ + cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending + jz 1f +2: call check_events +1: +ENDPATCH(xen_restore_fl_direct) + ret + ENDPROC(xen_restore_fl_direct) + RELOC(xen_restore_fl_direct, 2b+1) + + +/* + * Force an event check by making a hypercall, but preserve regs + * before making the call. + */ +check_events: +#ifdef CONFIG_X86_32 + push %eax + push %ecx + push %edx + call xen_force_evtchn_callback + pop %edx + pop %ecx + pop %eax +#else + push %rax + push %rcx + push %rdx + push %rsi + push %rdi + push %r8 + push %r9 + push %r10 + push %r11 + call xen_force_evtchn_callback + pop %r11 + pop %r10 + pop %r9 + pop %r8 + pop %rdi + pop %rsi + pop %rdx + pop %rcx + pop %rax +#endif + ret Index: linux-2.6-tip/arch/x86/xen/xen-asm.h =================================================================== --- /dev/null +++ linux-2.6-tip/arch/x86/xen/xen-asm.h @@ -0,0 +1,12 @@ +#ifndef _XEN_XEN_ASM_H +#define _XEN_XEN_ASM_H + +#include + +#define RELOC(x, v) .globl x##_reloc; x##_reloc=v +#define ENDPATCH(x) .globl x##_end; x##_end=. + +/* Pseudo-flag used for virtual NMI, which we don't implement yet */ +#define XEN_EFLAGS_NMI 0x80000000 + +#endif Index: linux-2.6-tip/arch/x86/xen/xen-asm_32.S =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/xen-asm_32.S +++ linux-2.6-tip/arch/x86/xen/xen-asm_32.S @@ -1,117 +1,43 @@ /* - Asm versions of Xen pv-ops, suitable for either direct use or inlining. - The inline versions are the same as the direct-use versions, with the - pre- and post-amble chopped off. - - This code is encoded for size rather than absolute efficiency, - with a view to being able to inline as much as possible. - - We only bother with direct forms (ie, vcpu in pda) of the operations - here; the indirect forms are better handled in C, since they're - generally too large to inline anyway. + * Asm versions of Xen pv-ops, suitable for either direct use or + * inlining. The inline versions are the same as the direct-use + * versions, with the pre- and post-amble chopped off. + * + * This code is encoded for size rather than absolute efficiency, with + * a view to being able to inline as much as possible. + * + * We only bother with direct forms (ie, vcpu in pda) of the + * operations here; the indirect forms are better handled in C, since + * they're generally too large to inline anyway. */ -#include - -#include #include -#include #include #include #include -#define RELOC(x, v) .globl x##_reloc; x##_reloc=v -#define ENDPATCH(x) .globl x##_end; x##_end=. - -/* Pseudo-flag used for virtual NMI, which we don't implement yet */ -#define XEN_EFLAGS_NMI 0x80000000 - -/* - Enable events. This clears the event mask and tests the pending - event status with one and operation. If there are pending - events, then enter the hypervisor to get them handled. - */ -ENTRY(xen_irq_enable_direct) - /* Unmask events */ - movb $0, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask - - /* Preempt here doesn't matter because that will deal with - any pending interrupts. The pending check may end up being - run on the wrong CPU, but that doesn't hurt. */ - - /* Test for pending */ - testb $0xff, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_pending - jz 1f - -2: call check_events -1: -ENDPATCH(xen_irq_enable_direct) - ret - ENDPROC(xen_irq_enable_direct) - RELOC(xen_irq_enable_direct, 2b+1) - +#include "xen-asm.h" /* - Disabling events is simply a matter of making the event mask - non-zero. + * Force an event check by making a hypercall, but preserve regs + * before making the call. */ -ENTRY(xen_irq_disable_direct) - movb $1, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask -ENDPATCH(xen_irq_disable_direct) - ret - ENDPROC(xen_irq_disable_direct) - RELOC(xen_irq_disable_direct, 0) - -/* - (xen_)save_fl is used to get the current interrupt enable status. - Callers expect the status to be in X86_EFLAGS_IF, and other bits - may be set in the return value. We take advantage of this by - making sure that X86_EFLAGS_IF has the right value (and other bits - in that byte are 0), but other bits in the return value are - undefined. We need to toggle the state of the bit, because - Xen and x86 use opposite senses (mask vs enable). - */ -ENTRY(xen_save_fl_direct) - testb $0xff, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask - setz %ah - addb %ah,%ah -ENDPATCH(xen_save_fl_direct) - ret - ENDPROC(xen_save_fl_direct) - RELOC(xen_save_fl_direct, 0) - - -/* - In principle the caller should be passing us a value return - from xen_save_fl_direct, but for robustness sake we test only - the X86_EFLAGS_IF flag rather than the whole byte. After - setting the interrupt mask state, it checks for unmasked - pending events and enters the hypervisor to get them delivered - if so. - */ -ENTRY(xen_restore_fl_direct) - testb $X86_EFLAGS_IF>>8, %ah - setz PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask - /* Preempt here doesn't matter because that will deal with - any pending interrupts. The pending check may end up being - run on the wrong CPU, but that doesn't hurt. */ - - /* check for unmasked and pending */ - cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_pending - jz 1f -2: call check_events -1: -ENDPATCH(xen_restore_fl_direct) +check_events: + push %eax + push %ecx + push %edx + call xen_force_evtchn_callback + pop %edx + pop %ecx + pop %eax ret - ENDPROC(xen_restore_fl_direct) - RELOC(xen_restore_fl_direct, 2b+1) /* - We can't use sysexit directly, because we're not running in ring0. - But we can easily fake it up using iret. Assuming xen_sysexit - is jumped to with a standard stack frame, we can just strip it - back to a standard iret frame and use iret. + * We can't use sysexit directly, because we're not running in ring0. + * But we can easily fake it up using iret. Assuming xen_sysexit is + * jumped to with a standard stack frame, we can just strip it back to + * a standard iret frame and use iret. */ ENTRY(xen_sysexit) movl PT_EAX(%esp), %eax /* Shouldn't be necessary? */ @@ -122,33 +48,31 @@ ENTRY(xen_sysexit) ENDPROC(xen_sysexit) /* - This is run where a normal iret would be run, with the same stack setup: - 8: eflags - 4: cs - esp-> 0: eip - - This attempts to make sure that any pending events are dealt - with on return to usermode, but there is a small window in - which an event can happen just before entering usermode. If - the nested interrupt ends up setting one of the TIF_WORK_MASK - pending work flags, they will not be tested again before - returning to usermode. This means that a process can end up - with pending work, which will be unprocessed until the process - enters and leaves the kernel again, which could be an - unbounded amount of time. This means that a pending signal or - reschedule event could be indefinitely delayed. - - The fix is to notice a nested interrupt in the critical - window, and if one occurs, then fold the nested interrupt into - the current interrupt stack frame, and re-process it - iteratively rather than recursively. This means that it will - exit via the normal path, and all pending work will be dealt - with appropriately. - - Because the nested interrupt handler needs to deal with the - current stack state in whatever form its in, we keep things - simple by only using a single register which is pushed/popped - on the stack. + * This is run where a normal iret would be run, with the same stack setup: + * 8: eflags + * 4: cs + * esp-> 0: eip + * + * This attempts to make sure that any pending events are dealt with + * on return to usermode, but there is a small window in which an + * event can happen just before entering usermode. If the nested + * interrupt ends up setting one of the TIF_WORK_MASK pending work + * flags, they will not be tested again before returning to + * usermode. This means that a process can end up with pending work, + * which will be unprocessed until the process enters and leaves the + * kernel again, which could be an unbounded amount of time. This + * means that a pending signal or reschedule event could be + * indefinitely delayed. + * + * The fix is to notice a nested interrupt in the critical window, and + * if one occurs, then fold the nested interrupt into the current + * interrupt stack frame, and re-process it iteratively rather than + * recursively. This means that it will exit via the normal path, and + * all pending work will be dealt with appropriately. + * + * Because the nested interrupt handler needs to deal with the current + * stack state in whatever form its in, we keep things simple by only + * using a single register which is pushed/popped on the stack. */ ENTRY(xen_iret) /* test eflags for special cases */ @@ -158,13 +82,15 @@ ENTRY(xen_iret) push %eax ESP_OFFSET=4 # bytes pushed onto stack - /* Store vcpu_info pointer for easy access. Do it this - way to avoid having to reload %fs */ + /* + * Store vcpu_info pointer for easy access. Do it this way to + * avoid having to reload %fs + */ #ifdef CONFIG_SMP GET_THREAD_INFO(%eax) - movl TI_cpu(%eax),%eax - movl __per_cpu_offset(,%eax,4),%eax - mov per_cpu__xen_vcpu(%eax),%eax + movl TI_cpu(%eax), %eax + movl __per_cpu_offset(,%eax,4), %eax + mov per_cpu__xen_vcpu(%eax), %eax #else movl per_cpu__xen_vcpu, %eax #endif @@ -172,37 +98,46 @@ ENTRY(xen_iret) /* check IF state we're restoring */ testb $X86_EFLAGS_IF>>8, 8+1+ESP_OFFSET(%esp) - /* Maybe enable events. Once this happens we could get a - recursive event, so the critical region starts immediately - afterwards. However, if that happens we don't end up - resuming the code, so we don't have to be worried about - being preempted to another CPU. */ + /* + * Maybe enable events. Once this happens we could get a + * recursive event, so the critical region starts immediately + * afterwards. However, if that happens we don't end up + * resuming the code, so we don't have to be worried about + * being preempted to another CPU. + */ setz XEN_vcpu_info_mask(%eax) xen_iret_start_crit: /* check for unmasked and pending */ cmpw $0x0001, XEN_vcpu_info_pending(%eax) - /* If there's something pending, mask events again so we - can jump back into xen_hypervisor_callback */ + /* + * If there's something pending, mask events again so we can + * jump back into xen_hypervisor_callback + */ sete XEN_vcpu_info_mask(%eax) popl %eax - /* From this point on the registers are restored and the stack - updated, so we don't need to worry about it if we're preempted */ + /* + * From this point on the registers are restored and the stack + * updated, so we don't need to worry about it if we're + * preempted + */ iret_restore_end: - /* Jump to hypervisor_callback after fixing up the stack. - Events are masked, so jumping out of the critical - region is OK. */ + /* + * Jump to hypervisor_callback after fixing up the stack. + * Events are masked, so jumping out of the critical region is + * OK. + */ je xen_hypervisor_callback 1: iret xen_iret_end_crit: -.section __ex_table,"a" +.section __ex_table, "a" .align 4 - .long 1b,iret_exc + .long 1b, iret_exc .previous hyper_iret: @@ -212,55 +147,55 @@ hyper_iret: .globl xen_iret_start_crit, xen_iret_end_crit /* - This is called by xen_hypervisor_callback in entry.S when it sees - that the EIP at the time of interrupt was between xen_iret_start_crit - and xen_iret_end_crit. We're passed the EIP in %eax so we can do - a more refined determination of what to do. - - The stack format at this point is: - ---------------- - ss : (ss/esp may be present if we came from usermode) - esp : - eflags } outer exception info - cs } - eip } - ---------------- <- edi (copy dest) - eax : outer eax if it hasn't been restored - ---------------- - eflags } nested exception info - cs } (no ss/esp because we're nested - eip } from the same ring) - orig_eax }<- esi (copy src) - - - - - - - - - - fs } - es } - ds } SAVE_ALL state - eax } - : : - ebx }<- esp - ---------------- - - In order to deliver the nested exception properly, we need to shift - everything from the return addr up to the error code so it - sits just under the outer exception info. This means that when we - handle the exception, we do it in the context of the outer exception - rather than starting a new one. - - The only caveat is that if the outer eax hasn't been - restored yet (ie, it's still on stack), we need to insert - its value into the SAVE_ALL state before going on, since - it's usermode state which we eventually need to restore. + * This is called by xen_hypervisor_callback in entry.S when it sees + * that the EIP at the time of interrupt was between + * xen_iret_start_crit and xen_iret_end_crit. We're passed the EIP in + * %eax so we can do a more refined determination of what to do. + * + * The stack format at this point is: + * ---------------- + * ss : (ss/esp may be present if we came from usermode) + * esp : + * eflags } outer exception info + * cs } + * eip } + * ---------------- <- edi (copy dest) + * eax : outer eax if it hasn't been restored + * ---------------- + * eflags } nested exception info + * cs } (no ss/esp because we're nested + * eip } from the same ring) + * orig_eax }<- esi (copy src) + * - - - - - - - - + * fs } + * es } + * ds } SAVE_ALL state + * eax } + * : : + * ebx }<- esp + * ---------------- + * + * In order to deliver the nested exception properly, we need to shift + * everything from the return addr up to the error code so it sits + * just under the outer exception info. This means that when we + * handle the exception, we do it in the context of the outer + * exception rather than starting a new one. + * + * The only caveat is that if the outer eax hasn't been restored yet + * (ie, it's still on stack), we need to insert its value into the + * SAVE_ALL state before going on, since it's usermode state which we + * eventually need to restore. */ ENTRY(xen_iret_crit_fixup) /* - Paranoia: Make sure we're really coming from kernel space. - One could imagine a case where userspace jumps into the - critical range address, but just before the CPU delivers a GP, - it decides to deliver an interrupt instead. Unlikely? - Definitely. Easy to avoid? Yes. The Intel documents - explicitly say that the reported EIP for a bad jump is the - jump instruction itself, not the destination, but some virtual - environments get this wrong. + * Paranoia: Make sure we're really coming from kernel space. + * One could imagine a case where userspace jumps into the + * critical range address, but just before the CPU delivers a + * GP, it decides to deliver an interrupt instead. Unlikely? + * Definitely. Easy to avoid? Yes. The Intel documents + * explicitly say that the reported EIP for a bad jump is the + * jump instruction itself, not the destination, but some + * virtual environments get this wrong. */ movl PT_CS(%esp), %ecx andl $SEGMENT_RPL_MASK, %ecx @@ -270,15 +205,17 @@ ENTRY(xen_iret_crit_fixup) lea PT_ORIG_EAX(%esp), %esi lea PT_EFLAGS(%esp), %edi - /* If eip is before iret_restore_end then stack - hasn't been restored yet. */ + /* + * If eip is before iret_restore_end then stack + * hasn't been restored yet. + */ cmp $iret_restore_end, %eax jae 1f - movl 0+4(%edi),%eax /* copy EAX (just above top of frame) */ + movl 0+4(%edi), %eax /* copy EAX (just above top of frame) */ movl %eax, PT_EAX(%esp) - lea ESP_OFFSET(%edi),%edi /* move dest up over saved regs */ + lea ESP_OFFSET(%edi), %edi /* move dest up over saved regs */ /* set up the copy */ 1: std @@ -286,20 +223,6 @@ ENTRY(xen_iret_crit_fixup) rep movsl cld - lea 4(%edi),%esp /* point esp to new frame */ + lea 4(%edi), %esp /* point esp to new frame */ 2: jmp xen_do_upcall - -/* - Force an event check by making a hypercall, - but preserve regs before making the call. - */ -check_events: - push %eax - push %ecx - push %edx - call xen_force_evtchn_callback - pop %edx - pop %ecx - pop %eax - ret Index: linux-2.6-tip/arch/x86/xen/xen-asm_64.S =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/xen-asm_64.S +++ linux-2.6-tip/arch/x86/xen/xen-asm_64.S @@ -1,174 +1,45 @@ /* - Asm versions of Xen pv-ops, suitable for either direct use or inlining. - The inline versions are the same as the direct-use versions, with the - pre- and post-amble chopped off. - - This code is encoded for size rather than absolute efficiency, - with a view to being able to inline as much as possible. - - We only bother with direct forms (ie, vcpu in pda) of the operations - here; the indirect forms are better handled in C, since they're - generally too large to inline anyway. + * Asm versions of Xen pv-ops, suitable for either direct use or + * inlining. The inline versions are the same as the direct-use + * versions, with the pre- and post-amble chopped off. + * + * This code is encoded for size rather than absolute efficiency, with + * a view to being able to inline as much as possible. + * + * We only bother with direct forms (ie, vcpu in pda) of the + * operations here; the indirect forms are better handled in C, since + * they're generally too large to inline anyway. */ -#include - -#include -#include #include +#include +#include #include #include -#define RELOC(x, v) .globl x##_reloc; x##_reloc=v -#define ENDPATCH(x) .globl x##_end; x##_end=. - -/* Pseudo-flag used for virtual NMI, which we don't implement yet */ -#define XEN_EFLAGS_NMI 0x80000000 - -#if 1 -/* - x86-64 does not yet support direct access to percpu variables - via a segment override, so we just need to make sure this code - never gets used - */ -#define BUG ud2a -#define PER_CPU_VAR(var, off) 0xdeadbeef -#endif - -/* - Enable events. This clears the event mask and tests the pending - event status with one and operation. If there are pending - events, then enter the hypervisor to get them handled. - */ -ENTRY(xen_irq_enable_direct) - BUG - - /* Unmask events */ - movb $0, PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_mask) - - /* Preempt here doesn't matter because that will deal with - any pending interrupts. The pending check may end up being - run on the wrong CPU, but that doesn't hurt. */ - - /* Test for pending */ - testb $0xff, PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_pending) - jz 1f - -2: call check_events -1: -ENDPATCH(xen_irq_enable_direct) - ret - ENDPROC(xen_irq_enable_direct) - RELOC(xen_irq_enable_direct, 2b+1) - -/* - Disabling events is simply a matter of making the event mask - non-zero. - */ -ENTRY(xen_irq_disable_direct) - BUG - - movb $1, PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_mask) -ENDPATCH(xen_irq_disable_direct) - ret - ENDPROC(xen_irq_disable_direct) - RELOC(xen_irq_disable_direct, 0) - -/* - (xen_)save_fl is used to get the current interrupt enable status. - Callers expect the status to be in X86_EFLAGS_IF, and other bits - may be set in the return value. We take advantage of this by - making sure that X86_EFLAGS_IF has the right value (and other bits - in that byte are 0), but other bits in the return value are - undefined. We need to toggle the state of the bit, because - Xen and x86 use opposite senses (mask vs enable). - */ -ENTRY(xen_save_fl_direct) - BUG - - testb $0xff, PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_mask) - setz %ah - addb %ah,%ah -ENDPATCH(xen_save_fl_direct) - ret - ENDPROC(xen_save_fl_direct) - RELOC(xen_save_fl_direct, 0) - -/* - In principle the caller should be passing us a value return - from xen_save_fl_direct, but for robustness sake we test only - the X86_EFLAGS_IF flag rather than the whole byte. After - setting the interrupt mask state, it checks for unmasked - pending events and enters the hypervisor to get them delivered - if so. - */ -ENTRY(xen_restore_fl_direct) - BUG - - testb $X86_EFLAGS_IF>>8, %ah - setz PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_mask) - /* Preempt here doesn't matter because that will deal with - any pending interrupts. The pending check may end up being - run on the wrong CPU, but that doesn't hurt. */ - - /* check for unmasked and pending */ - cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info, XEN_vcpu_info_pending) - jz 1f -2: call check_events -1: -ENDPATCH(xen_restore_fl_direct) - ret - ENDPROC(xen_restore_fl_direct) - RELOC(xen_restore_fl_direct, 2b+1) - - -/* - Force an event check by making a hypercall, - but preserve regs before making the call. - */ -check_events: - push %rax - push %rcx - push %rdx - push %rsi - push %rdi - push %r8 - push %r9 - push %r10 - push %r11 - call xen_force_evtchn_callback - pop %r11 - pop %r10 - pop %r9 - pop %r8 - pop %rdi - pop %rsi - pop %rdx - pop %rcx - pop %rax - ret +#include "xen-asm.h" ENTRY(xen_adjust_exception_frame) - mov 8+0(%rsp),%rcx - mov 8+8(%rsp),%r11 + mov 8+0(%rsp), %rcx + mov 8+8(%rsp), %r11 ret $16 hypercall_iret = hypercall_page + __HYPERVISOR_iret * 32 /* - Xen64 iret frame: - - ss - rsp - rflags - cs - rip <-- standard iret frame - - flags - - rcx } - r11 }<-- pushed by hypercall page -rsp -> rax } + * Xen64 iret frame: + * + * ss + * rsp + * rflags + * cs + * rip <-- standard iret frame + * + * flags + * + * rcx } + * r11 }<-- pushed by hypercall page + * rsp->rax } */ ENTRY(xen_iret) pushq $0 @@ -177,8 +48,8 @@ ENDPATCH(xen_iret) RELOC(xen_iret, 1b+1) /* - sysexit is not used for 64-bit processes, so it's - only ever used to return to 32-bit compat userspace. + * sysexit is not used for 64-bit processes, so it's only ever used to + * return to 32-bit compat userspace. */ ENTRY(xen_sysexit) pushq $__USER32_DS @@ -193,13 +64,15 @@ ENDPATCH(xen_sysexit) RELOC(xen_sysexit, 1b+1) ENTRY(xen_sysret64) - /* We're already on the usermode stack at this point, but still - with the kernel gs, so we can easily switch back */ - movq %rsp, %gs:pda_oldrsp - movq %gs:pda_kernelstack,%rsp + /* + * We're already on the usermode stack at this point, but + * still with the kernel gs, so we can easily switch back + */ + movq %rsp, PER_CPU_VAR(old_rsp) + movq PER_CPU_VAR(kernel_stack), %rsp pushq $__USER_DS - pushq %gs:pda_oldrsp + pushq PER_CPU_VAR(old_rsp) pushq %r11 pushq $__USER_CS pushq %rcx @@ -210,13 +83,15 @@ ENDPATCH(xen_sysret64) RELOC(xen_sysret64, 1b+1) ENTRY(xen_sysret32) - /* We're already on the usermode stack at this point, but still - with the kernel gs, so we can easily switch back */ - movq %rsp, %gs:pda_oldrsp - movq %gs:pda_kernelstack, %rsp + /* + * We're already on the usermode stack at this point, but + * still with the kernel gs, so we can easily switch back + */ + movq %rsp, PER_CPU_VAR(old_rsp) + movq PER_CPU_VAR(kernel_stack), %rsp pushq $__USER32_DS - pushq %gs:pda_oldrsp + pushq PER_CPU_VAR(old_rsp) pushq %r11 pushq $__USER32_CS pushq %rcx @@ -227,28 +102,27 @@ ENDPATCH(xen_sysret32) RELOC(xen_sysret32, 1b+1) /* - Xen handles syscall callbacks much like ordinary exceptions, - which means we have: - - kernel gs - - kernel rsp - - an iret-like stack frame on the stack (including rcx and r11): - ss - rsp - rflags - cs - rip - r11 - rsp-> rcx - - In all the entrypoints, we undo all that to make it look - like a CPU-generated syscall/sysenter and jump to the normal - entrypoint. + * Xen handles syscall callbacks much like ordinary exceptions, which + * means we have: + * - kernel gs + * - kernel rsp + * - an iret-like stack frame on the stack (including rcx and r11): + * ss + * rsp + * rflags + * cs + * rip + * r11 + * rsp->rcx + * + * In all the entrypoints, we undo all that to make it look like a + * CPU-generated syscall/sysenter and jump to the normal entrypoint. */ .macro undo_xen_syscall - mov 0*8(%rsp),%rcx - mov 1*8(%rsp),%r11 - mov 5*8(%rsp),%rsp + mov 0*8(%rsp), %rcx + mov 1*8(%rsp), %r11 + mov 5*8(%rsp), %rsp .endm /* Normal 64-bit system call target */ @@ -275,7 +149,7 @@ ENDPROC(xen_sysenter_target) ENTRY(xen_syscall32_target) ENTRY(xen_sysenter_target) - lea 16(%rsp), %rsp /* strip %rcx,%r11 */ + lea 16(%rsp), %rsp /* strip %rcx, %r11 */ mov $-ENOSYS, %rax pushq $VGCF_in_syscall jmp hypercall_iret Index: linux-2.6-tip/arch/x86/xen/xen-head.S =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/xen-head.S +++ linux-2.6-tip/arch/x86/xen/xen-head.S @@ -8,7 +8,7 @@ #include #include -#include +#include #include #include Index: linux-2.6-tip/arch/x86/xen/xen-ops.h =================================================================== --- linux-2.6-tip.orig/arch/x86/xen/xen-ops.h +++ linux-2.6-tip/arch/x86/xen/xen-ops.h @@ -10,9 +10,12 @@ extern const char xen_hypervisor_callback[]; extern const char xen_failsafe_callback[]; +extern void *xen_initial_gdt; + struct trap_info; void xen_copy_trap_info(struct trap_info *traps); +DECLARE_PER_CPU(struct vcpu_info, xen_vcpu_info); DECLARE_PER_CPU(unsigned long, xen_cr3); DECLARE_PER_CPU(unsigned long, xen_current_cr3); @@ -22,6 +25,13 @@ extern struct shared_info *HYPERVISOR_sh void xen_setup_mfn_list_list(void); void xen_setup_shared_info(void); +void xen_setup_machphys_mapping(void); +pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn); +void xen_ident_map_ISA(void); +void xen_reserve_top(void); + +void xen_leave_lazy(void); +void xen_post_allocator_init(void); char * __init xen_memory_setup(void); void __init xen_arch_setup(void); Index: linux-2.6-tip/arch/xtensa/include/asm/swab.h =================================================================== --- linux-2.6-tip.orig/arch/xtensa/include/asm/swab.h +++ linux-2.6-tip/arch/xtensa/include/asm/swab.h @@ -11,7 +11,7 @@ #ifndef _XTENSA_SWAB_H #define _XTENSA_SWAB_H -#include +#include #include #define __SWAB_64_THRU_32__ Index: linux-2.6-tip/arch/xtensa/kernel/irq.c =================================================================== --- linux-2.6-tip.orig/arch/xtensa/kernel/irq.c +++ linux-2.6-tip/arch/xtensa/kernel/irq.c @@ -99,7 +99,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%10u ", kstat_irqs(i)); #else for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); + seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); #endif seq_printf(p, " %14s", irq_desc[i].chip->typename); seq_printf(p, " %s", action->name); Index: linux-2.6-tip/block/Kconfig =================================================================== --- linux-2.6-tip.orig/block/Kconfig +++ linux-2.6-tip/block/Kconfig @@ -44,22 +44,6 @@ config LBD If unsure, say N. -config BLK_DEV_IO_TRACE - bool "Support for tracing block io actions" - depends on SYSFS - select RELAY - select DEBUG_FS - select TRACEPOINTS - help - Say Y here if you want to be able to trace the block layer actions - on a given queue. Tracing allows you to see any traffic happening - on a block device queue. For more information (and the userspace - support tools needed), fetch the blktrace tools from: - - git://git.kernel.dk/blktrace.git - - If unsure, say N. - config BLK_DEV_BSG bool "Block layer SG support v4 (EXPERIMENTAL)" depends on EXPERIMENTAL Index: linux-2.6-tip/block/Makefile =================================================================== --- linux-2.6-tip.orig/block/Makefile +++ linux-2.6-tip/block/Makefile @@ -13,6 +13,5 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o -obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o Index: linux-2.6-tip/block/blktrace.c =================================================================== --- linux-2.6-tip.orig/block/blktrace.c +++ /dev/null @@ -1,860 +0,0 @@ -/* - * Copyright (C) 2006 Jens Axboe - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA - * - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static unsigned int blktrace_seq __read_mostly = 1; - -/* Global reference count of probes */ -static DEFINE_MUTEX(blk_probe_mutex); -static atomic_t blk_probes_ref = ATOMIC_INIT(0); - -static int blk_register_tracepoints(void); -static void blk_unregister_tracepoints(void); - -/* - * Send out a notify message. - */ -static void trace_note(struct blk_trace *bt, pid_t pid, int action, - const void *data, size_t len) -{ - struct blk_io_trace *t; - - t = relay_reserve(bt->rchan, sizeof(*t) + len); - if (t) { - const int cpu = smp_processor_id(); - - t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION; - t->time = ktime_to_ns(ktime_get()); - t->device = bt->dev; - t->action = action; - t->pid = pid; - t->cpu = cpu; - t->pdu_len = len; - memcpy((void *) t + sizeof(*t), data, len); - } -} - -/* - * Send out a notify for this process, if we haven't done so since a trace - * started - */ -static void trace_note_tsk(struct blk_trace *bt, struct task_struct *tsk) -{ - tsk->btrace_seq = blktrace_seq; - trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm, sizeof(tsk->comm)); -} - -static void trace_note_time(struct blk_trace *bt) -{ - struct timespec now; - unsigned long flags; - u32 words[2]; - - getnstimeofday(&now); - words[0] = now.tv_sec; - words[1] = now.tv_nsec; - - local_irq_save(flags); - trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words)); - local_irq_restore(flags); -} - -void __trace_note_message(struct blk_trace *bt, const char *fmt, ...) -{ - int n; - va_list args; - unsigned long flags; - char *buf; - - local_irq_save(flags); - buf = per_cpu_ptr(bt->msg_data, smp_processor_id()); - va_start(args, fmt); - n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args); - va_end(args); - - trace_note(bt, 0, BLK_TN_MESSAGE, buf, n); - local_irq_restore(flags); -} -EXPORT_SYMBOL_GPL(__trace_note_message); - -static int act_log_check(struct blk_trace *bt, u32 what, sector_t sector, - pid_t pid) -{ - if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0) - return 1; - if (sector < bt->start_lba || sector > bt->end_lba) - return 1; - if (bt->pid && pid != bt->pid) - return 1; - - return 0; -} - -/* - * Data direction bit lookup - */ -static u32 ddir_act[2] __read_mostly = { BLK_TC_ACT(BLK_TC_READ), BLK_TC_ACT(BLK_TC_WRITE) }; - -/* The ilog2() calls fall out because they're constant */ -#define MASK_TC_BIT(rw, __name) ( (rw & (1 << BIO_RW_ ## __name)) << \ - (ilog2(BLK_TC_ ## __name) + BLK_TC_SHIFT - BIO_RW_ ## __name) ) - -/* - * The worker for the various blk_add_trace*() types. Fills out a - * blk_io_trace structure and places it in a per-cpu subbuffer. - */ -static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes, - int rw, u32 what, int error, int pdu_len, void *pdu_data) -{ - struct task_struct *tsk = current; - struct blk_io_trace *t; - unsigned long flags; - unsigned long *sequence; - pid_t pid; - int cpu; - - if (unlikely(bt->trace_state != Blktrace_running)) - return; - - what |= ddir_act[rw & WRITE]; - what |= MASK_TC_BIT(rw, BARRIER); - what |= MASK_TC_BIT(rw, SYNCIO); - what |= MASK_TC_BIT(rw, AHEAD); - what |= MASK_TC_BIT(rw, META); - what |= MASK_TC_BIT(rw, DISCARD); - - pid = tsk->pid; - if (unlikely(act_log_check(bt, what, sector, pid))) - return; - - /* - * A word about the locking here - we disable interrupts to reserve - * some space in the relay per-cpu buffer, to prevent an irq - * from coming in and stepping on our toes. - */ - local_irq_save(flags); - - if (unlikely(tsk->btrace_seq != blktrace_seq)) - trace_note_tsk(bt, tsk); - - t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len); - if (t) { - cpu = smp_processor_id(); - sequence = per_cpu_ptr(bt->sequence, cpu); - - t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION; - t->sequence = ++(*sequence); - t->time = ktime_to_ns(ktime_get()); - t->sector = sector; - t->bytes = bytes; - t->action = what; - t->pid = pid; - t->device = bt->dev; - t->cpu = cpu; - t->error = error; - t->pdu_len = pdu_len; - - if (pdu_len) - memcpy((void *) t + sizeof(*t), pdu_data, pdu_len); - } - - local_irq_restore(flags); -} - -static struct dentry *blk_tree_root; -static DEFINE_MUTEX(blk_tree_mutex); - -static void blk_trace_cleanup(struct blk_trace *bt) -{ - debugfs_remove(bt->msg_file); - debugfs_remove(bt->dropped_file); - relay_close(bt->rchan); - free_percpu(bt->sequence); - free_percpu(bt->msg_data); - kfree(bt); - mutex_lock(&blk_probe_mutex); - if (atomic_dec_and_test(&blk_probes_ref)) - blk_unregister_tracepoints(); - mutex_unlock(&blk_probe_mutex); -} - -int blk_trace_remove(struct request_queue *q) -{ - struct blk_trace *bt; - - bt = xchg(&q->blk_trace, NULL); - if (!bt) - return -EINVAL; - - if (bt->trace_state == Blktrace_setup || - bt->trace_state == Blktrace_stopped) - blk_trace_cleanup(bt); - - return 0; -} -EXPORT_SYMBOL_GPL(blk_trace_remove); - -static int blk_dropped_open(struct inode *inode, struct file *filp) -{ - filp->private_data = inode->i_private; - - return 0; -} - -static ssize_t blk_dropped_read(struct file *filp, char __user *buffer, - size_t count, loff_t *ppos) -{ - struct blk_trace *bt = filp->private_data; - char buf[16]; - - snprintf(buf, sizeof(buf), "%u\n", atomic_read(&bt->dropped)); - - return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf)); -} - -static const struct file_operations blk_dropped_fops = { - .owner = THIS_MODULE, - .open = blk_dropped_open, - .read = blk_dropped_read, -}; - -static int blk_msg_open(struct inode *inode, struct file *filp) -{ - filp->private_data = inode->i_private; - - return 0; -} - -static ssize_t blk_msg_write(struct file *filp, const char __user *buffer, - size_t count, loff_t *ppos) -{ - char *msg; - struct blk_trace *bt; - - if (count > BLK_TN_MAX_MSG) - return -EINVAL; - - msg = kmalloc(count, GFP_KERNEL); - if (msg == NULL) - return -ENOMEM; - - if (copy_from_user(msg, buffer, count)) { - kfree(msg); - return -EFAULT; - } - - bt = filp->private_data; - __trace_note_message(bt, "%s", msg); - kfree(msg); - - return count; -} - -static const struct file_operations blk_msg_fops = { - .owner = THIS_MODULE, - .open = blk_msg_open, - .write = blk_msg_write, -}; - -/* - * Keep track of how many times we encountered a full subbuffer, to aid - * the user space app in telling how many lost events there were. - */ -static int blk_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, - void *prev_subbuf, size_t prev_padding) -{ - struct blk_trace *bt; - - if (!relay_buf_full(buf)) - return 1; - - bt = buf->chan->private_data; - atomic_inc(&bt->dropped); - return 0; -} - -static int blk_remove_buf_file_callback(struct dentry *dentry) -{ - struct dentry *parent = dentry->d_parent; - debugfs_remove(dentry); - - /* - * this will fail for all but the last file, but that is ok. what we - * care about is the top level buts->name directory going away, when - * the last trace file is gone. Then we don't have to rmdir() that - * manually on trace stop, so it nicely solves the issue with - * force killing of running traces. - */ - - debugfs_remove(parent); - return 0; -} - -static struct dentry *blk_create_buf_file_callback(const char *filename, - struct dentry *parent, - int mode, - struct rchan_buf *buf, - int *is_global) -{ - return debugfs_create_file(filename, mode, parent, buf, - &relay_file_operations); -} - -static struct rchan_callbacks blk_relay_callbacks = { - .subbuf_start = blk_subbuf_start_callback, - .create_buf_file = blk_create_buf_file_callback, - .remove_buf_file = blk_remove_buf_file_callback, -}; - -/* - * Setup everything required to start tracing - */ -int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, - struct blk_user_trace_setup *buts) -{ - struct blk_trace *old_bt, *bt = NULL; - struct dentry *dir = NULL; - int ret, i; - - if (!buts->buf_size || !buts->buf_nr) - return -EINVAL; - - strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); - buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; - - /* - * some device names have larger paths - convert the slashes - * to underscores for this to work as expected - */ - for (i = 0; i < strlen(buts->name); i++) - if (buts->name[i] == '/') - buts->name[i] = '_'; - - ret = -ENOMEM; - bt = kzalloc(sizeof(*bt), GFP_KERNEL); - if (!bt) - goto err; - - bt->sequence = alloc_percpu(unsigned long); - if (!bt->sequence) - goto err; - - bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG); - if (!bt->msg_data) - goto err; - - ret = -ENOENT; - - if (!blk_tree_root) { - blk_tree_root = debugfs_create_dir("block", NULL); - if (!blk_tree_root) - return -ENOMEM; - } - - dir = debugfs_create_dir(buts->name, blk_tree_root); - - if (!dir) - goto err; - - bt->dir = dir; - bt->dev = dev; - atomic_set(&bt->dropped, 0); - - ret = -EIO; - bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt, &blk_dropped_fops); - if (!bt->dropped_file) - goto err; - - bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops); - if (!bt->msg_file) - goto err; - - bt->rchan = relay_open("trace", dir, buts->buf_size, - buts->buf_nr, &blk_relay_callbacks, bt); - if (!bt->rchan) - goto err; - - bt->act_mask = buts->act_mask; - if (!bt->act_mask) - bt->act_mask = (u16) -1; - - bt->start_lba = buts->start_lba; - bt->end_lba = buts->end_lba; - if (!bt->end_lba) - bt->end_lba = -1ULL; - - bt->pid = buts->pid; - bt->trace_state = Blktrace_setup; - - mutex_lock(&blk_probe_mutex); - if (atomic_add_return(1, &blk_probes_ref) == 1) { - ret = blk_register_tracepoints(); - if (ret) - goto probe_err; - } - mutex_unlock(&blk_probe_mutex); - - ret = -EBUSY; - old_bt = xchg(&q->blk_trace, bt); - if (old_bt) { - (void) xchg(&q->blk_trace, old_bt); - goto err; - } - - return 0; -probe_err: - atomic_dec(&blk_probes_ref); - mutex_unlock(&blk_probe_mutex); -err: - if (bt) { - if (bt->msg_file) - debugfs_remove(bt->msg_file); - if (bt->dropped_file) - debugfs_remove(bt->dropped_file); - free_percpu(bt->sequence); - free_percpu(bt->msg_data); - if (bt->rchan) - relay_close(bt->rchan); - kfree(bt); - } - return ret; -} - -int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, - char __user *arg) -{ - struct blk_user_trace_setup buts; - int ret; - - ret = copy_from_user(&buts, arg, sizeof(buts)); - if (ret) - return -EFAULT; - - ret = do_blk_trace_setup(q, name, dev, &buts); - if (ret) - return ret; - - if (copy_to_user(arg, &buts, sizeof(buts))) - return -EFAULT; - - return 0; -} -EXPORT_SYMBOL_GPL(blk_trace_setup); - -int blk_trace_startstop(struct request_queue *q, int start) -{ - struct blk_trace *bt; - int ret; - - if ((bt = q->blk_trace) == NULL) - return -EINVAL; - - /* - * For starting a trace, we can transition from a setup or stopped - * trace. For stopping a trace, the state must be running - */ - ret = -EINVAL; - if (start) { - if (bt->trace_state == Blktrace_setup || - bt->trace_state == Blktrace_stopped) { - blktrace_seq++; - smp_mb(); - bt->trace_state = Blktrace_running; - - trace_note_time(bt); - ret = 0; - } - } else { - if (bt->trace_state == Blktrace_running) { - bt->trace_state = Blktrace_stopped; - relay_flush(bt->rchan); - ret = 0; - } - } - - return ret; -} -EXPORT_SYMBOL_GPL(blk_trace_startstop); - -/** - * blk_trace_ioctl: - handle the ioctls associated with tracing - * @bdev: the block device - * @cmd: the ioctl cmd - * @arg: the argument data, if any - * - **/ -int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) -{ - struct request_queue *q; - int ret, start = 0; - char b[BDEVNAME_SIZE]; - - q = bdev_get_queue(bdev); - if (!q) - return -ENXIO; - - mutex_lock(&bdev->bd_mutex); - - switch (cmd) { - case BLKTRACESETUP: - bdevname(bdev, b); - ret = blk_trace_setup(q, b, bdev->bd_dev, arg); - break; - case BLKTRACESTART: - start = 1; - case BLKTRACESTOP: - ret = blk_trace_startstop(q, start); - break; - case BLKTRACETEARDOWN: - ret = blk_trace_remove(q); - break; - default: - ret = -ENOTTY; - break; - } - - mutex_unlock(&bdev->bd_mutex); - return ret; -} - -/** - * blk_trace_shutdown: - stop and cleanup trace structures - * @q: the request queue associated with the device - * - **/ -void blk_trace_shutdown(struct request_queue *q) -{ - if (q->blk_trace) { - blk_trace_startstop(q, 0); - blk_trace_remove(q); - } -} - -/* - * blktrace probes - */ - -/** - * blk_add_trace_rq - Add a trace for a request oriented action - * @q: queue the io is for - * @rq: the source request - * @what: the action - * - * Description: - * Records an action against a request. Will log the bio offset + size. - * - **/ -static void blk_add_trace_rq(struct request_queue *q, struct request *rq, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - int rw = rq->cmd_flags & 0x03; - - if (likely(!bt)) - return; - - if (blk_discard_rq(rq)) - rw |= (1 << BIO_RW_DISCARD); - - if (blk_pc_request(rq)) { - what |= BLK_TC_ACT(BLK_TC_PC); - __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, - sizeof(rq->cmd), rq->cmd); - } else { - what |= BLK_TC_ACT(BLK_TC_FS); - __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, - rw, what, rq->errors, 0, NULL); - } -} - -static void blk_add_trace_rq_abort(struct request_queue *q, struct request *rq) -{ - blk_add_trace_rq(q, rq, BLK_TA_ABORT); -} - -static void blk_add_trace_rq_insert(struct request_queue *q, struct request *rq) -{ - blk_add_trace_rq(q, rq, BLK_TA_INSERT); -} - -static void blk_add_trace_rq_issue(struct request_queue *q, struct request *rq) -{ - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); -} - -static void blk_add_trace_rq_requeue(struct request_queue *q, struct request *rq) -{ - blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); -} - -static void blk_add_trace_rq_complete(struct request_queue *q, struct request *rq) -{ - blk_add_trace_rq(q, rq, BLK_TA_COMPLETE); -} - -/** - * blk_add_trace_bio - Add a trace for a bio oriented action - * @q: queue the io is for - * @bio: the source bio - * @what: the action - * - * Description: - * Records an action against a bio. Will log the bio offset + size. - * - **/ -static void blk_add_trace_bio(struct request_queue *q, struct bio *bio, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, - !bio_flagged(bio, BIO_UPTODATE), 0, NULL); -} - -static void blk_add_trace_bio_bounce(struct request_queue *q, struct bio *bio) -{ - blk_add_trace_bio(q, bio, BLK_TA_BOUNCE); -} - -static void blk_add_trace_bio_complete(struct request_queue *q, struct bio *bio) -{ - blk_add_trace_bio(q, bio, BLK_TA_COMPLETE); -} - -static void blk_add_trace_bio_backmerge(struct request_queue *q, struct bio *bio) -{ - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); -} - -static void blk_add_trace_bio_frontmerge(struct request_queue *q, struct bio *bio) -{ - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); -} - -static void blk_add_trace_bio_queue(struct request_queue *q, struct bio *bio) -{ - blk_add_trace_bio(q, bio, BLK_TA_QUEUE); -} - -static void blk_add_trace_getrq(struct request_queue *q, struct bio *bio, int rw) -{ - if (bio) - blk_add_trace_bio(q, bio, BLK_TA_GETRQ); - else { - struct blk_trace *bt = q->blk_trace; - - if (bt) - __blk_add_trace(bt, 0, 0, rw, BLK_TA_GETRQ, 0, 0, NULL); - } -} - - -static void blk_add_trace_sleeprq(struct request_queue *q, struct bio *bio, int rw) -{ - if (bio) - blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ); - else { - struct blk_trace *bt = q->blk_trace; - - if (bt) - __blk_add_trace(bt, 0, 0, rw, BLK_TA_SLEEPRQ, 0, 0, NULL); - } -} - -static void blk_add_trace_plug(struct request_queue *q) -{ - struct blk_trace *bt = q->blk_trace; - - if (bt) - __blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL); -} - -static void blk_add_trace_unplug_io(struct request_queue *q) -{ - struct blk_trace *bt = q->blk_trace; - - if (bt) { - unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; - __be64 rpdu = cpu_to_be64(pdu); - - __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0, - sizeof(rpdu), &rpdu); - } -} - -static void blk_add_trace_unplug_timer(struct request_queue *q) -{ - struct blk_trace *bt = q->blk_trace; - - if (bt) { - unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; - __be64 rpdu = cpu_to_be64(pdu); - - __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0, - sizeof(rpdu), &rpdu); - } -} - -static void blk_add_trace_split(struct request_queue *q, struct bio *bio, - unsigned int pdu) -{ - struct blk_trace *bt = q->blk_trace; - - if (bt) { - __be64 rpdu = cpu_to_be64(pdu); - - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, - BLK_TA_SPLIT, !bio_flagged(bio, BIO_UPTODATE), - sizeof(rpdu), &rpdu); - } -} - -/** - * blk_add_trace_remap - Add a trace for a remap operation - * @q: queue the io is for - * @bio: the source bio - * @dev: target device - * @from: source sector - * @to: target sector - * - * Description: - * Device mapper or raid target sometimes need to split a bio because - * it spans a stripe (or similar). Add a trace for that action. - * - **/ -static void blk_add_trace_remap(struct request_queue *q, struct bio *bio, - dev_t dev, sector_t from, sector_t to) -{ - struct blk_trace *bt = q->blk_trace; - struct blk_io_trace_remap r; - - if (likely(!bt)) - return; - - r.device = cpu_to_be32(dev); - r.device_from = cpu_to_be32(bio->bi_bdev->bd_dev); - r.sector = cpu_to_be64(to); - - __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, - !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); -} - -/** - * blk_add_driver_data - Add binary message with driver-specific data - * @q: queue the io is for - * @rq: io request - * @data: driver-specific data - * @len: length of driver-specific data - * - * Description: - * Some drivers might want to write driver-specific data per request. - * - **/ -void blk_add_driver_data(struct request_queue *q, - struct request *rq, - void *data, size_t len) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - if (blk_pc_request(rq)) - __blk_add_trace(bt, 0, rq->data_len, 0, BLK_TA_DRV_DATA, - rq->errors, len, data); - else - __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, - 0, BLK_TA_DRV_DATA, rq->errors, len, data); -} -EXPORT_SYMBOL_GPL(blk_add_driver_data); - -static int blk_register_tracepoints(void) -{ - int ret; - - ret = register_trace_block_rq_abort(blk_add_trace_rq_abort); - WARN_ON(ret); - ret = register_trace_block_rq_insert(blk_add_trace_rq_insert); - WARN_ON(ret); - ret = register_trace_block_rq_issue(blk_add_trace_rq_issue); - WARN_ON(ret); - ret = register_trace_block_rq_requeue(blk_add_trace_rq_requeue); - WARN_ON(ret); - ret = register_trace_block_rq_complete(blk_add_trace_rq_complete); - WARN_ON(ret); - ret = register_trace_block_bio_bounce(blk_add_trace_bio_bounce); - WARN_ON(ret); - ret = register_trace_block_bio_complete(blk_add_trace_bio_complete); - WARN_ON(ret); - ret = register_trace_block_bio_backmerge(blk_add_trace_bio_backmerge); - WARN_ON(ret); - ret = register_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge); - WARN_ON(ret); - ret = register_trace_block_bio_queue(blk_add_trace_bio_queue); - WARN_ON(ret); - ret = register_trace_block_getrq(blk_add_trace_getrq); - WARN_ON(ret); - ret = register_trace_block_sleeprq(blk_add_trace_sleeprq); - WARN_ON(ret); - ret = register_trace_block_plug(blk_add_trace_plug); - WARN_ON(ret); - ret = register_trace_block_unplug_timer(blk_add_trace_unplug_timer); - WARN_ON(ret); - ret = register_trace_block_unplug_io(blk_add_trace_unplug_io); - WARN_ON(ret); - ret = register_trace_block_split(blk_add_trace_split); - WARN_ON(ret); - ret = register_trace_block_remap(blk_add_trace_remap); - WARN_ON(ret); - return 0; -} - -static void blk_unregister_tracepoints(void) -{ - unregister_trace_block_remap(blk_add_trace_remap); - unregister_trace_block_split(blk_add_trace_split); - unregister_trace_block_unplug_io(blk_add_trace_unplug_io); - unregister_trace_block_unplug_timer(blk_add_trace_unplug_timer); - unregister_trace_block_plug(blk_add_trace_plug); - unregister_trace_block_sleeprq(blk_add_trace_sleeprq); - unregister_trace_block_getrq(blk_add_trace_getrq); - unregister_trace_block_bio_queue(blk_add_trace_bio_queue); - unregister_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge); - unregister_trace_block_bio_backmerge(blk_add_trace_bio_backmerge); - unregister_trace_block_bio_complete(blk_add_trace_bio_complete); - unregister_trace_block_bio_bounce(blk_add_trace_bio_bounce); - unregister_trace_block_rq_complete(blk_add_trace_rq_complete); - unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue); - unregister_trace_block_rq_issue(blk_add_trace_rq_issue); - unregister_trace_block_rq_insert(blk_add_trace_rq_insert); - unregister_trace_block_rq_abort(blk_add_trace_rq_abort); - - tracepoint_synchronize_unregister(); -} Index: linux-2.6-tip/block/bsg.c =================================================================== --- linux-2.6-tip.orig/block/bsg.c +++ linux-2.6-tip/block/bsg.c @@ -249,7 +249,7 @@ bsg_map_hdr(struct bsg_device *bd, struc { struct request_queue *q = bd->queue; struct request *rq, *next_rq = NULL; - int ret, rw; + int ret, uninitialized_var(rw); unsigned int dxfer_len; void *dxferp = NULL; Index: linux-2.6-tip/block/cfq-iosched.c =================================================================== --- linux-2.6-tip.orig/block/cfq-iosched.c +++ linux-2.6-tip/block/cfq-iosched.c @@ -1539,6 +1539,7 @@ cfq_async_queue_prio(struct cfq_data *cf return &cfqd->async_idle_cfqq; default: BUG(); + return NULL; } } Index: linux-2.6-tip/drivers/acpi/acpica/exprep.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/exprep.c +++ linux-2.6-tip/drivers/acpi/acpica/exprep.c @@ -320,7 +320,7 @@ acpi_ex_prep_common_field_object(union a u32 field_bit_position, u32 field_bit_length) { u32 access_bit_width; - u32 byte_alignment; + u32 uninitialized_var(byte_alignment); u32 nearest_byte_address; ACPI_FUNCTION_TRACE(ex_prep_common_field_object); Index: linux-2.6-tip/drivers/acpi/acpica/nsxfeval.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/nsxfeval.c +++ linux-2.6-tip/drivers/acpi/acpica/nsxfeval.c @@ -469,6 +469,9 @@ acpi_walk_namespace(acpi_object_type typ ACPI_FUNCTION_TRACE(acpi_walk_namespace); + if (acpi_disabled) + return_ACPI_STATUS(AE_NO_NAMESPACE); + /* Parameter validation */ if ((type > ACPI_TYPE_LOCAL_MAX) || (!max_depth) || (!user_function)) { Index: linux-2.6-tip/drivers/acpi/acpica/tbxface.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/tbxface.c +++ linux-2.6-tip/drivers/acpi/acpica/tbxface.c @@ -365,7 +365,7 @@ ACPI_EXPORT_SYMBOL(acpi_unload_table_id) /******************************************************************************* * - * FUNCTION: acpi_get_table + * FUNCTION: acpi_get_table_with_size * * PARAMETERS: Signature - ACPI signature of needed table * Instance - Which instance (for SSDTs) @@ -377,8 +377,9 @@ ACPI_EXPORT_SYMBOL(acpi_unload_table_id) * *****************************************************************************/ acpi_status -acpi_get_table(char *signature, - u32 instance, struct acpi_table_header **out_table) +acpi_get_table_with_size(char *signature, + u32 instance, struct acpi_table_header **out_table, + acpi_size *tbl_size) { u32 i; u32 j; @@ -408,6 +409,7 @@ acpi_get_table(char *signature, acpi_tb_verify_table(&acpi_gbl_root_table_list.tables[i]); if (ACPI_SUCCESS(status)) { *out_table = acpi_gbl_root_table_list.tables[i].pointer; + *tbl_size = acpi_gbl_root_table_list.tables[i].length; } if (!acpi_gbl_permanent_mmap) { @@ -420,6 +422,15 @@ acpi_get_table(char *signature, return (AE_NOT_FOUND); } +acpi_status +acpi_get_table(char *signature, + u32 instance, struct acpi_table_header **out_table) +{ + acpi_size tbl_size; + + return acpi_get_table_with_size(signature, + instance, out_table, &tbl_size); +} ACPI_EXPORT_SYMBOL(acpi_get_table) /******************************************************************************* Index: linux-2.6-tip/drivers/acpi/osl.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/osl.c +++ linux-2.6-tip/drivers/acpi/osl.c @@ -272,14 +272,21 @@ acpi_os_map_memory(acpi_physical_address } EXPORT_SYMBOL_GPL(acpi_os_map_memory); -void acpi_os_unmap_memory(void __iomem * virt, acpi_size size) +void __ref acpi_os_unmap_memory(void __iomem *virt, acpi_size size) { - if (acpi_gbl_permanent_mmap) { + if (acpi_gbl_permanent_mmap) iounmap(virt); - } + else + __acpi_unmap_table(virt, size); } EXPORT_SYMBOL_GPL(acpi_os_unmap_memory); +void __init early_acpi_os_unmap_memory(void __iomem *virt, acpi_size size) +{ + if (!acpi_gbl_permanent_mmap) + __acpi_unmap_table(virt, size); +} + #ifdef ACPI_FUTURE_USAGE acpi_status acpi_os_get_physical_address(void *virt, acpi_physical_address * phys) @@ -792,12 +799,12 @@ void acpi_os_delete_lock(acpi_spinlock h acpi_status acpi_os_create_semaphore(u32 max_units, u32 initial_units, acpi_handle * handle) { - struct semaphore *sem = NULL; + struct compat_semaphore *sem = NULL; - sem = acpi_os_allocate(sizeof(struct semaphore)); + sem = acpi_os_allocate(sizeof(struct compat_semaphore)); if (!sem) return AE_NO_MEMORY; - memset(sem, 0, sizeof(struct semaphore)); + memset(sem, 0, sizeof(struct compat_semaphore)); sema_init(sem, initial_units); @@ -818,7 +825,7 @@ acpi_os_create_semaphore(u32 max_units, acpi_status acpi_os_delete_semaphore(acpi_handle handle) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem) return AE_BAD_PARAMETER; @@ -838,7 +845,7 @@ acpi_status acpi_os_delete_semaphore(acp acpi_status acpi_os_wait_semaphore(acpi_handle handle, u32 units, u16 timeout) { acpi_status status = AE_OK; - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; long jiffies; int ret = 0; @@ -879,7 +886,7 @@ acpi_status acpi_os_wait_semaphore(acpi_ */ acpi_status acpi_os_signal_semaphore(acpi_handle handle, u32 units) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem || (units < 1)) return AE_BAD_PARAMETER; Index: linux-2.6-tip/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/processor_idle.c +++ linux-2.6-tip/drivers/acpi/processor_idle.c @@ -824,8 +824,11 @@ static int acpi_idle_bm_check(void) */ static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx) { + u64 perf_flags; + /* Don't trace irqs off for idle */ stop_critical_timings(); + perf_flags = hw_perf_save_disable(); if (cx->entry_method == ACPI_CSTATE_FFH) { /* Call into architectural FFH based C-state */ acpi_processor_ffh_cstate_enter(cx); @@ -840,6 +843,7 @@ static inline void acpi_idle_do_entry(st gets asserted in time to freeze execution properly. */ unused = inl(acpi_gbl_FADT.xpm_timer_block.address); } + hw_perf_restore(perf_flags); start_critical_timings(); } @@ -952,7 +956,7 @@ static int acpi_idle_enter_simple(struct } static int c3_cpu_count; -static DEFINE_SPINLOCK(c3_lock); +static DEFINE_RAW_SPINLOCK(c3_lock); /** * acpi_idle_enter_bm - enters C3 with proper BM handling Index: linux-2.6-tip/drivers/acpi/sbs.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/sbs.c +++ linux-2.6-tip/drivers/acpi/sbs.c @@ -389,6 +389,8 @@ static int acpi_battery_get_state(struct return result; } +#if defined(CONFIG_ACPI_SYSFS_POWER) || defined(CONFIG_ACPI_PROCFS_POWER) + static int acpi_battery_get_alarm(struct acpi_battery *battery) { return acpi_smbus_read(battery->sbs->hc, SMBUS_READ_WORD, @@ -425,6 +427,8 @@ static int acpi_battery_set_alarm(struct return ret; } +#endif + static int acpi_ac_get_present(struct acpi_sbs *sbs) { int result; @@ -816,7 +820,10 @@ static int acpi_battery_add(struct acpi_ static void acpi_battery_remove(struct acpi_sbs *sbs, int id) { +#if defined(CONFIG_ACPI_SYSFS_POWER) || defined(CONFIG_ACPI_PROCFS_POWER) struct acpi_battery *battery = &sbs->battery[id]; +#endif + #ifdef CONFIG_ACPI_SYSFS_POWER if (battery->bat.dev) { if (battery->have_sysfs_alarm) Index: linux-2.6-tip/drivers/acpi/sleep.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/sleep.c +++ linux-2.6-tip/drivers/acpi/sleep.c @@ -24,6 +24,7 @@ #include "sleep.h" u8 sleep_states[ACPI_S_STATE_COUNT]; +static u32 acpi_target_sleep_state = ACPI_STATE_S0; static void acpi_sleep_tts_switch(u32 acpi_state) { @@ -77,7 +78,6 @@ static int acpi_sleep_prepare(u32 acpi_s } #ifdef CONFIG_ACPI_SLEEP -static u32 acpi_target_sleep_state = ACPI_STATE_S0; /* * ACPI 1.0 wants us to execute _PTS before suspending devices, so we allow the * user to request that behavior by using the 'acpi_old_suspend_ordering' Index: linux-2.6-tip/drivers/acpi/tables.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/tables.c +++ linux-2.6-tip/drivers/acpi/tables.c @@ -181,14 +181,15 @@ acpi_table_parse_entries(char *id, struct acpi_subtable_header *entry; unsigned int count = 0; unsigned long table_end; + acpi_size tbl_size; if (!handler) return -EINVAL; if (strncmp(id, ACPI_SIG_MADT, 4) == 0) - acpi_get_table(id, acpi_apic_instance, &table_header); + acpi_get_table_with_size(id, acpi_apic_instance, &table_header, &tbl_size); else - acpi_get_table(id, 0, &table_header); + acpi_get_table_with_size(id, 0, &table_header, &tbl_size); if (!table_header) { printk(KERN_WARNING PREFIX "%4.4s not present\n", id); @@ -206,8 +207,10 @@ acpi_table_parse_entries(char *id, table_end) { if (entry->type == entry_id && (!max_entries || count++ < max_entries)) - if (handler(entry, table_end)) + if (handler(entry, table_end)) { + early_acpi_os_unmap_memory((char *)table_header, tbl_size); return -EINVAL; + } entry = (struct acpi_subtable_header *) ((unsigned long)entry + entry->length); @@ -217,6 +220,7 @@ acpi_table_parse_entries(char *id, "%i found\n", id, entry_id, count - max_entries, count); } + early_acpi_os_unmap_memory((char *)table_header, tbl_size); return count; } @@ -241,17 +245,19 @@ acpi_table_parse_madt(enum acpi_madt_typ int __init acpi_table_parse(char *id, acpi_table_handler handler) { struct acpi_table_header *table = NULL; + acpi_size tbl_size; if (!handler) return -EINVAL; if (strncmp(id, ACPI_SIG_MADT, 4) == 0) - acpi_get_table(id, acpi_apic_instance, &table); + acpi_get_table_with_size(id, acpi_apic_instance, &table, &tbl_size); else - acpi_get_table(id, 0, &table); + acpi_get_table_with_size(id, 0, &table, &tbl_size); if (table) { handler(table); + early_acpi_os_unmap_memory(table, tbl_size); return 0; } else return 1; @@ -265,8 +271,9 @@ int __init acpi_table_parse(char *id, ac static void __init check_multiple_madt(void) { struct acpi_table_header *table = NULL; + acpi_size tbl_size; - acpi_get_table(ACPI_SIG_MADT, 2, &table); + acpi_get_table_with_size(ACPI_SIG_MADT, 2, &table, &tbl_size); if (table) { printk(KERN_WARNING PREFIX "BIOS bug: multiple APIC/MADT found," @@ -275,6 +282,7 @@ static void __init check_multiple_madt(v "If \"acpi_apic_instance=%d\" works better, " "notify linux-acpi@vger.kernel.org\n", acpi_apic_instance ? 0 : 2); + early_acpi_os_unmap_memory(table, tbl_size); } else acpi_apic_instance = 0; Index: linux-2.6-tip/drivers/ata/libata-core.c =================================================================== --- linux-2.6-tip.orig/drivers/ata/libata-core.c +++ linux-2.6-tip/drivers/ata/libata-core.c @@ -1482,7 +1482,7 @@ static int ata_hpa_resize(struct ata_dev struct ata_eh_context *ehc = &dev->link->eh_context; int print_info = ehc->i.flags & ATA_EHI_PRINTINFO; u64 sectors = ata_id_n_sectors(dev->id); - u64 native_sectors; + u64 uninitialized_var(native_sectors); int rc; /* do we need to do it? */ Index: linux-2.6-tip/drivers/ata/libata-scsi.c =================================================================== --- linux-2.6-tip.orig/drivers/ata/libata-scsi.c +++ linux-2.6-tip/drivers/ata/libata-scsi.c @@ -3247,7 +3247,7 @@ void ata_scsi_scan_host(struct ata_port int tries = 5; struct ata_device *last_failed_dev = NULL; struct ata_link *link; - struct ata_device *dev; + struct ata_device *uninitialized_var(dev); if (ap->flags & ATA_FLAG_DISABLED) return; Index: linux-2.6-tip/drivers/ata/pata_atiixp.c =================================================================== --- linux-2.6-tip.orig/drivers/ata/pata_atiixp.c +++ linux-2.6-tip/drivers/ata/pata_atiixp.c @@ -140,7 +140,7 @@ static void atiixp_set_dmamode(struct at wanted_pio = 3; else if (adev->dma_mode == XFER_MW_DMA_0) wanted_pio = 0; - else BUG(); + else panic("atiixp_set_dmamode: unknown DMA mode!"); if (adev->pio_mode != wanted_pio) atiixp_set_pio_timing(ap, adev, wanted_pio); Index: linux-2.6-tip/drivers/ata/sata_via.c =================================================================== --- linux-2.6-tip.orig/drivers/ata/sata_via.c +++ linux-2.6-tip/drivers/ata/sata_via.c @@ -566,7 +566,7 @@ static int svia_init_one(struct pci_dev static int printed_version; unsigned int i; int rc; - struct ata_host *host; + struct ata_host *uninitialized_var(host); int board_id = (int) ent->driver_data; const unsigned *bar_sizes; Index: linux-2.6-tip/drivers/atm/ambassador.c =================================================================== --- linux-2.6-tip.orig/drivers/atm/ambassador.c +++ linux-2.6-tip/drivers/atm/ambassador.c @@ -2097,7 +2097,7 @@ static int __devinit amb_init (amb_dev * { loader_block lb; - u32 version; + u32 version = -1; if (amb_reset (dev, 1)) { PRINTK (KERN_ERR, "card reset failed!"); Index: linux-2.6-tip/drivers/atm/horizon.c =================================================================== --- linux-2.6-tip.orig/drivers/atm/horizon.c +++ linux-2.6-tip/drivers/atm/horizon.c @@ -2131,7 +2131,7 @@ static int atm_pcr_check (struct atm_tra static int hrz_open (struct atm_vcc *atm_vcc) { int error; - u16 channel; + u16 uninitialized_var(channel); struct atm_qos * qos; struct atm_trafprm * txtp; Index: linux-2.6-tip/drivers/base/cpu.c =================================================================== --- linux-2.6-tip.orig/drivers/base/cpu.c +++ linux-2.6-tip/drivers/base/cpu.c @@ -107,7 +107,7 @@ static SYSDEV_ATTR(crash_notes, 0400, sh /* * Print cpu online, possible, present, and system maps */ -static ssize_t print_cpus_map(char *buf, cpumask_t *map) +static ssize_t print_cpus_map(char *buf, const struct cpumask *map) { int n = cpulist_scnprintf(buf, PAGE_SIZE-2, map); Index: linux-2.6-tip/drivers/base/platform.c =================================================================== --- linux-2.6-tip.orig/drivers/base/platform.c +++ linux-2.6-tip/drivers/base/platform.c @@ -611,7 +611,8 @@ static int platform_match(struct device #ifdef CONFIG_PM_SLEEP -static int platform_legacy_suspend(struct device *dev, pm_message_t mesg) +static inline int +platform_legacy_suspend(struct device *dev, pm_message_t mesg) { int ret = 0; @@ -621,7 +622,8 @@ static int platform_legacy_suspend(struc return ret; } -static int platform_legacy_suspend_late(struct device *dev, pm_message_t mesg) +static inline int +platform_legacy_suspend_late(struct device *dev, pm_message_t mesg) { struct platform_driver *drv = to_platform_driver(dev->driver); struct platform_device *pdev; @@ -634,7 +636,7 @@ static int platform_legacy_suspend_late( return ret; } -static int platform_legacy_resume_early(struct device *dev) +static inline int platform_legacy_resume_early(struct device *dev) { struct platform_driver *drv = to_platform_driver(dev->driver); struct platform_device *pdev; @@ -647,7 +649,7 @@ static int platform_legacy_resume_early( return ret; } -static int platform_legacy_resume(struct device *dev) +static inline int platform_legacy_resume(struct device *dev) { int ret = 0; Index: linux-2.6-tip/drivers/base/topology.c =================================================================== --- linux-2.6-tip.orig/drivers/base/topology.c +++ linux-2.6-tip/drivers/base/topology.c @@ -31,7 +31,10 @@ #include #include -#define define_one_ro(_name) \ +#define define_one_ro_named(_name, _func) \ +static SYSDEV_ATTR(_name, 0444, _func, NULL) + +#define define_one_ro(_name) \ static SYSDEV_ATTR(_name, 0444, show_##_name, NULL) #define define_id_show_func(name) \ @@ -42,8 +45,8 @@ static ssize_t show_##name(struct sys_de return sprintf(buf, "%d\n", topology_##name(cpu)); \ } -#if defined(topology_thread_siblings) || defined(topology_core_siblings) -static ssize_t show_cpumap(int type, cpumask_t *mask, char *buf) +#if defined(topology_thread_cpumask) || defined(topology_core_cpumask) +static ssize_t show_cpumap(int type, const struct cpumask *mask, char *buf) { ptrdiff_t len = PTR_ALIGN(buf + PAGE_SIZE - 1, PAGE_SIZE) - buf; int n = 0; @@ -65,7 +68,7 @@ static ssize_t show_##name(struct sys_de struct sysdev_attribute *attr, char *buf) \ { \ unsigned int cpu = dev->id; \ - return show_cpumap(0, &(topology_##name(cpu)), buf); \ + return show_cpumap(0, topology_##name(cpu), buf); \ } #define define_siblings_show_list(name) \ @@ -74,7 +77,7 @@ static ssize_t show_##name##_list(struct char *buf) \ { \ unsigned int cpu = dev->id; \ - return show_cpumap(1, &(topology_##name(cpu)), buf); \ + return show_cpumap(1, topology_##name(cpu), buf); \ } #else @@ -82,9 +85,7 @@ static ssize_t show_##name##_list(struct static ssize_t show_##name(struct sys_device *dev, \ struct sysdev_attribute *attr, char *buf) \ { \ - unsigned int cpu = dev->id; \ - cpumask_t mask = topology_##name(cpu); \ - return show_cpumap(0, &mask, buf); \ + return show_cpumap(0, topology_##name(dev->id), buf); \ } #define define_siblings_show_list(name) \ @@ -92,9 +93,7 @@ static ssize_t show_##name##_list(struct struct sysdev_attribute *attr, \ char *buf) \ { \ - unsigned int cpu = dev->id; \ - cpumask_t mask = topology_##name(cpu); \ - return show_cpumap(1, &mask, buf); \ + return show_cpumap(1, topology_##name(dev->id), buf); \ } #endif @@ -107,13 +106,13 @@ define_one_ro(physical_package_id); define_id_show_func(core_id); define_one_ro(core_id); -define_siblings_show_func(thread_siblings); -define_one_ro(thread_siblings); -define_one_ro(thread_siblings_list); - -define_siblings_show_func(core_siblings); -define_one_ro(core_siblings); -define_one_ro(core_siblings_list); +define_siblings_show_func(thread_cpumask); +define_one_ro_named(thread_siblings, show_thread_cpumask); +define_one_ro_named(thread_siblings_list, show_thread_cpumask_list); + +define_siblings_show_func(core_cpumask); +define_one_ro_named(core_siblings, show_core_cpumask); +define_one_ro_named(core_siblings_list, show_core_cpumask_list); static struct attribute *default_attrs[] = { &attr_physical_package_id.attr, Index: linux-2.6-tip/drivers/block/DAC960.c =================================================================== --- linux-2.6-tip.orig/drivers/block/DAC960.c +++ linux-2.6-tip/drivers/block/DAC960.c @@ -6646,7 +6646,8 @@ static long DAC960_gam_ioctl(struct file (DAC960_ControllerInfo_T __user *) Argument; DAC960_ControllerInfo_T ControllerInfo; DAC960_Controller_T *Controller; - int ControllerNumber; + int uninitialized_var(ControllerNumber); + if (UserSpaceControllerInfo == NULL) ErrorCode = -EINVAL; else ErrorCode = get_user(ControllerNumber, Index: linux-2.6-tip/drivers/char/ip2/ip2main.c =================================================================== --- linux-2.6-tip.orig/drivers/char/ip2/ip2main.c +++ linux-2.6-tip/drivers/char/ip2/ip2main.c @@ -3202,4 +3202,4 @@ static struct pci_device_id ip2main_pci_ { } }; -MODULE_DEVICE_TABLE(pci, ip2main_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, ip2main_pci_tbl); Index: linux-2.6-tip/drivers/char/ipmi/ipmi_msghandler.c =================================================================== --- linux-2.6-tip.orig/drivers/char/ipmi/ipmi_msghandler.c +++ linux-2.6-tip/drivers/char/ipmi/ipmi_msghandler.c @@ -1796,7 +1796,8 @@ int ipmi_request_settime(ipmi_user_t int retries, unsigned int retry_time_ms) { - unsigned char saddr, lun; + unsigned char uninitialized_var(saddr), + uninitialized_var(lun); int rv; if (!user) @@ -1828,7 +1829,8 @@ int ipmi_request_supply_msgs(ipmi_user_t struct ipmi_recv_msg *supplied_recv, int priority) { - unsigned char saddr, lun; + unsigned char uninitialized_var(saddr), + uninitialized_var(lun); int rv; if (!user) Index: linux-2.6-tip/drivers/char/isicom.c =================================================================== --- linux-2.6-tip.orig/drivers/char/isicom.c +++ linux-2.6-tip/drivers/char/isicom.c @@ -1585,7 +1585,7 @@ static unsigned int card_count; static int __devinit isicom_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { - unsigned int signature, index; + unsigned int uninitialized_var(signature), index; int retval = -EPERM; struct isi_board *board = NULL; Index: linux-2.6-tip/drivers/char/random.c =================================================================== --- linux-2.6-tip.orig/drivers/char/random.c +++ linux-2.6-tip/drivers/char/random.c @@ -241,6 +241,10 @@ #include #include +#ifdef CONFIG_GENERIC_HARDIRQS +# include +#endif + #include #include #include @@ -558,7 +562,7 @@ struct timer_rand_state { unsigned dont_count_entropy:1; }; -#ifndef CONFIG_SPARSE_IRQ +#ifndef CONFIG_GENERIC_HARDIRQS static struct timer_rand_state *irq_timer_state[NR_IRQS]; @@ -619,8 +623,11 @@ static void add_timer_randomness(struct preempt_disable(); /* if over the trickle threshold, use only 1 in 4096 samples */ if (input_pool.entropy_count > trickle_thresh && - (__get_cpu_var(trickle_count)++ & 0xfff)) - goto out; + (__get_cpu_var(trickle_count)++ & 0xfff)) { + preempt_enable(); + return; + } + preempt_enable(); sample.jiffies = jiffies; sample.cycles = get_cycles(); @@ -662,8 +669,6 @@ static void add_timer_randomness(struct credit_entropy_bits(&input_pool, min_t(int, fls(delta>>1), 11)); } -out: - preempt_enable(); } void add_input_randomness(unsigned int type, unsigned int code, Index: linux-2.6-tip/drivers/char/rocket.c =================================================================== --- linux-2.6-tip.orig/drivers/char/rocket.c +++ linux-2.6-tip/drivers/char/rocket.c @@ -150,12 +150,14 @@ static Word_t aiop_intr_bits[AIOP_CTL_SI AIOP_INTR_BIT_3 }; +#ifdef CONFIG_PCI static Word_t upci_aiop_intr_bits[AIOP_CTL_SIZE] = { UPCI_AIOP_INTR_BIT_0, UPCI_AIOP_INTR_BIT_1, UPCI_AIOP_INTR_BIT_2, UPCI_AIOP_INTR_BIT_3 }; +#endif static Byte_t RData[RDATASIZE] = { 0x00, 0x09, 0xf6, 0x82, @@ -227,7 +229,6 @@ static unsigned long nextLineNumber; static int __init init_ISA(int i); static void rp_wait_until_sent(struct tty_struct *tty, int timeout); static void rp_flush_buffer(struct tty_struct *tty); -static void rmSpeakerReset(CONTROLLER_T * CtlP, unsigned long model); static unsigned char GetLineNumber(int ctrl, int aiop, int ch); static unsigned char SetLineNumber(int ctrl, int aiop, int ch); static void rp_start(struct tty_struct *tty); @@ -241,11 +242,14 @@ static void sDisInterrupts(CHANNEL_T * C static void sModemReset(CONTROLLER_T * CtlP, int chan, int on); static void sPCIModemReset(CONTROLLER_T * CtlP, int chan, int on); static int sWriteTxPrioByte(CHANNEL_T * ChP, Byte_t Data); +#ifdef CONFIG_PCI +static void rmSpeakerReset(CONTROLLER_T * CtlP, unsigned long model); static int sPCIInitController(CONTROLLER_T * CtlP, int CtlNum, ByteIO_t * AiopIOList, int AiopIOListSize, WordIO_t ConfigIO, int IRQNum, Byte_t Frequency, int PeriodicOnly, int altChanRingIndicator, int UPCIRingInd); +#endif static int sInitController(CONTROLLER_T * CtlP, int CtlNum, ByteIO_t MudbacIO, ByteIO_t * AiopIOList, int AiopIOListSize, int IRQNum, Byte_t Frequency, int PeriodicOnly); @@ -1751,7 +1755,7 @@ static struct pci_device_id __devinitdat { PCI_DEVICE(PCI_VENDOR_ID_RP, PCI_ANY_ID) }, { } }; -MODULE_DEVICE_TABLE(pci, rocket_pci_ids); +MODULE_STATIC_DEVICE_TABLE(pci, rocket_pci_ids); /* * Called when a PCI card is found. Retrieves and stores model information, @@ -2533,6 +2537,7 @@ static int sInitController(CONTROLLER_T return (CtlP->NumAiop); } +#ifdef CONFIG_PCI /*************************************************************************** Function: sPCIInitController Purpose: Initialization of controller global registers and controller @@ -2652,6 +2657,7 @@ static int sPCIInitController(CONTROLLER else return (CtlP->NumAiop); } +#endif /* CONFIG_PCI */ /*************************************************************************** Function: sReadAiopID @@ -3142,6 +3148,7 @@ static void sPCIModemReset(CONTROLLER_T sOutB(addr + chan, 0); /* apply or remove reset */ } +#ifdef CONFIG_PCI /* Resets the speaker controller on RocketModem II and III devices */ static void rmSpeakerReset(CONTROLLER_T * CtlP, unsigned long model) { @@ -3160,6 +3167,7 @@ static void rmSpeakerReset(CONTROLLER_T sOutB(addr, 0); } } +#endif /* CONFIG_PCI */ /* Returns the line number given the controller (board), aiop and channel number */ static unsigned char GetLineNumber(int ctrl, int aiop, int ch) Index: linux-2.6-tip/drivers/char/rtc.c =================================================================== --- linux-2.6-tip.orig/drivers/char/rtc.c +++ linux-2.6-tip/drivers/char/rtc.c @@ -188,7 +188,9 @@ static int rtc_proc_open(struct inode *i * timer (but you would need to have an awful timing before you'd trip on it) */ static unsigned long rtc_status; /* bitmapped status byte. */ +#if defined(RTC_IRQ) || defined(CONFIG_PROC_FS) static unsigned long rtc_freq; /* Current periodic IRQ rate */ +#endif static unsigned long rtc_irq_data; /* our output to the world */ static unsigned long rtc_max_user_freq = 64; /* > this, need CAP_SYS_RESOURCE */ @@ -1074,7 +1076,9 @@ no_irq: #endif #if defined(__alpha__) || defined(__mips__) +#ifdef CONFIG_PROC_FS rtc_freq = HZ; +#endif /* Each operating system on an Alpha uses its own epoch. Let's try to guess which one we are using now. */ @@ -1197,10 +1201,12 @@ static void rtc_dropped_irq(unsigned lon spin_unlock_irq(&rtc_lock); +#ifndef CONFIG_PREEMPT_RT if (printk_ratelimit()) { printk(KERN_WARNING "rtc: lost some interrupts at %ldHz.\n", freq); } +#endif /* Now we have new data */ wake_up_interruptible(&rtc_wait); Index: linux-2.6-tip/drivers/char/specialix.c =================================================================== --- linux-2.6-tip.orig/drivers/char/specialix.c +++ linux-2.6-tip/drivers/char/specialix.c @@ -2359,7 +2359,7 @@ static struct pci_device_id specialx_pci { PCI_DEVICE(PCI_VENDOR_ID_SPECIALIX, PCI_DEVICE_ID_SPECIALIX_IO8) }, { } }; -MODULE_DEVICE_TABLE(pci, specialx_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, specialx_pci_tbl); module_init(specialix_init_module); module_exit(specialix_exit_module); Index: linux-2.6-tip/drivers/char/sysrq.c =================================================================== --- linux-2.6-tip.orig/drivers/char/sysrq.c +++ linux-2.6-tip/drivers/char/sysrq.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -35,7 +36,7 @@ #include #include #include -#include +#include #include #include @@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int ke struct pt_regs *regs = get_irq_regs(); if (regs) show_regs(regs); + perf_counter_print_debug(); } static struct sysrq_key_op sysrq_showregs_op = { .handler = sysrq_handle_showregs, @@ -283,7 +285,7 @@ static void sysrq_ftrace_dump(int key, s } static struct sysrq_key_op sysrq_ftrace_dump_op = { .handler = sysrq_ftrace_dump, - .help_msg = "dumpZ-ftrace-buffer", + .help_msg = "dump-ftrace-buffer(Z)", .action_msg = "Dump ftrace buffer", .enable_mask = SYSRQ_ENABLE_DUMP, }; Index: linux-2.6-tip/drivers/clocksource/acpi_pm.c =================================================================== --- linux-2.6-tip.orig/drivers/clocksource/acpi_pm.c +++ linux-2.6-tip/drivers/clocksource/acpi_pm.c @@ -143,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE #endif #ifndef CONFIG_X86_64 -#include "mach_timer.h" +#include #define PMTMR_EXPECTED_RATE \ ((CALIBRATE_LATCH * (PMTMR_TICKS_PER_SEC >> 10)) / (CLOCK_TICK_RATE>>10)) /* Index: linux-2.6-tip/drivers/clocksource/cyclone.c =================================================================== --- linux-2.6-tip.orig/drivers/clocksource/cyclone.c +++ linux-2.6-tip/drivers/clocksource/cyclone.c @@ -7,7 +7,7 @@ #include #include -#include "mach_timer.h" +#include #define CYCLONE_CBAR_ADDR 0xFEB00CD0 /* base address ptr */ #define CYCLONE_PMCC_OFFSET 0x51A0 /* offset to control register */ Index: linux-2.6-tip/drivers/eisa/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/eisa/Kconfig +++ linux-2.6-tip/drivers/eisa/Kconfig @@ -3,7 +3,7 @@ # config EISA_VLB_PRIMING bool "Vesa Local Bus priming" - depends on X86_PC && EISA + depends on X86 && EISA default n ---help--- Activate this option if your system contains a Vesa Local @@ -24,11 +24,11 @@ config EISA_PCI_EISA When in doubt, say Y. # Using EISA_VIRTUAL_ROOT on something other than an Alpha or -# an X86_PC may lead to crashes... +# an X86 may lead to crashes... config EISA_VIRTUAL_ROOT bool "EISA virtual root device" - depends on EISA && (ALPHA || X86_PC) + depends on EISA && (ALPHA || X86) default y ---help--- Activate this option if your system only have EISA bus Index: linux-2.6-tip/drivers/firmware/dcdbas.c =================================================================== --- linux-2.6-tip.orig/drivers/firmware/dcdbas.c +++ linux-2.6-tip/drivers/firmware/dcdbas.c @@ -244,7 +244,7 @@ static ssize_t host_control_on_shutdown_ */ int dcdbas_smi_request(struct smi_cmd *smi_cmd) { - cpumask_t old_mask; + cpumask_var_t old_mask; int ret = 0; if (smi_cmd->magic != SMI_CMD_MAGIC) { @@ -254,8 +254,11 @@ int dcdbas_smi_request(struct smi_cmd *s } /* SMI requires CPU 0 */ - old_mask = current->cpus_allowed; - set_cpus_allowed_ptr(current, &cpumask_of_cpu(0)); + if (!alloc_cpumask_var(&old_mask, GFP_KERNEL)) + return -ENOMEM; + + cpumask_copy(old_mask, ¤t->cpus_allowed); + set_cpus_allowed_ptr(current, cpumask_of(0)); if (smp_processor_id() != 0) { dev_dbg(&dcdbas_pdev->dev, "%s: failed to get CPU 0\n", __func__); @@ -275,7 +278,8 @@ int dcdbas_smi_request(struct smi_cmd *s ); out: - set_cpus_allowed_ptr(current, &old_mask); + set_cpus_allowed_ptr(current, old_mask); + free_cpumask_var(old_mask); return ret; } Index: linux-2.6-tip/drivers/firmware/iscsi_ibft.c =================================================================== --- linux-2.6-tip.orig/drivers/firmware/iscsi_ibft.c +++ linux-2.6-tip/drivers/firmware/iscsi_ibft.c @@ -938,8 +938,8 @@ static int __init ibft_init(void) return -ENOMEM; if (ibft_addr) { - printk(KERN_INFO "iBFT detected at 0x%lx.\n", - virt_to_phys((void *)ibft_addr)); + printk(KERN_INFO "iBFT detected at 0x%llx.\n", + (u64)virt_to_phys((void *)ibft_addr)); rc = ibft_check_device(); if (rc) Index: linux-2.6-tip/drivers/gpu/drm/drm_proc.c =================================================================== --- linux-2.6-tip.orig/drivers/gpu/drm/drm_proc.c +++ linux-2.6-tip/drivers/gpu/drm/drm_proc.c @@ -678,9 +678,9 @@ static int drm__vma_info(char *buf, char *start = &buf[offset]; *eof = 0; - DRM_PROC_PRINT("vma use count: %d, high_memory = %p, 0x%08lx\n", + DRM_PROC_PRINT("vma use count: %d, high_memory = %p, 0x%llx\n", atomic_read(&dev->vma_count), - high_memory, virt_to_phys(high_memory)); + high_memory, (u64)virt_to_phys(high_memory)); list_for_each_entry(pt, &dev->vmalist, head) { if (!(vma = pt->vma)) continue; Index: linux-2.6-tip/drivers/gpu/drm/i915/i915_gem.c =================================================================== --- linux-2.6-tip.orig/drivers/gpu/drm/i915/i915_gem.c +++ linux-2.6-tip/drivers/gpu/drm/i915/i915_gem.c @@ -1051,6 +1051,9 @@ i915_gem_retire_requests(struct drm_devi drm_i915_private_t *dev_priv = dev->dev_private; uint32_t seqno; + if (!dev_priv->hw_status_page) + return; + seqno = i915_get_gem_seqno(dev); while (!list_empty(&dev_priv->mm.request_list)) { Index: linux-2.6-tip/drivers/hwmon/adt7473.c =================================================================== --- linux-2.6-tip.orig/drivers/hwmon/adt7473.c +++ linux-2.6-tip/drivers/hwmon/adt7473.c @@ -848,6 +848,8 @@ static ssize_t show_pwm_auto_temp(struct } /* shouldn't ever get here */ BUG(); + + return 0; } static ssize_t set_pwm_auto_temp(struct device *dev, Index: linux-2.6-tip/drivers/hwmon/i5k_amb.c =================================================================== --- linux-2.6-tip.orig/drivers/hwmon/i5k_amb.c +++ linux-2.6-tip/drivers/hwmon/i5k_amb.c @@ -480,7 +480,7 @@ static unsigned long i5k_channel_pci_id( case PCI_DEVICE_ID_INTEL_5400_ERR: return PCI_DEVICE_ID_INTEL_5400_FBD0 + channel; default: - BUG(); + panic("i5k_channel_pci_id: unknown chipset!"); } } Index: linux-2.6-tip/drivers/i2c/busses/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/i2c/busses/Kconfig +++ linux-2.6-tip/drivers/i2c/busses/Kconfig @@ -56,6 +56,9 @@ config I2C_AMD756 config I2C_AMD756_S4882 tristate "SMBus multiplexing on the Tyan S4882" depends on I2C_AMD756 && X86 && EXPERIMENTAL + # broke an Athlon 64 X2 Asus A8N-E with: + # http://redhat.com/~mingo/misc/config-Thu_Jul_17_11_34_08_CEST_2008.bad + depends on 0 help Enabling this option will add specific SMBus support for the Tyan S4882 motherboard. On this 4-CPU board, the SMBus is multiplexed @@ -150,6 +153,9 @@ config I2C_NFORCE2 config I2C_NFORCE2_S4985 tristate "SMBus multiplexing on the Tyan S4985" depends on I2C_NFORCE2 && X86 && EXPERIMENTAL + # broke a T60 Core2Duo with: + # http://redhat.com/~mingo/misc/config-Thu_Jul_17_10_47_42_CEST_2008.bad + depends on 0 help Enabling this option will add specific SMBus support for the Tyan S4985 motherboard. On this 4-CPU board, the SMBus is multiplexed Index: linux-2.6-tip/drivers/ieee1394/csr1212.c =================================================================== --- linux-2.6-tip.orig/drivers/ieee1394/csr1212.c +++ linux-2.6-tip/drivers/ieee1394/csr1212.c @@ -35,6 +35,7 @@ #include #include +#include #include #include #include @@ -387,6 +388,7 @@ csr1212_new_descriptor_leaf(u8 dtype, u3 if (!kv) return NULL; + kmemcheck_annotate_bitfield(kv->value.leaf.data[0]); CSR1212_DESCRIPTOR_LEAF_SET_TYPE(kv, dtype); CSR1212_DESCRIPTOR_LEAF_SET_SPECIFIER_ID(kv, specifier_id); Index: linux-2.6-tip/drivers/ieee1394/nodemgr.c =================================================================== --- linux-2.6-tip.orig/drivers/ieee1394/nodemgr.c +++ linux-2.6-tip/drivers/ieee1394/nodemgr.c @@ -10,6 +10,7 @@ #include #include +#include #include #include #include @@ -39,7 +40,10 @@ struct nodemgr_csr_info { struct hpsb_host *host; nodeid_t nodeid; unsigned int generation; - unsigned int speed_unverified:1; + + kmemcheck_define_bitfield(flags, { + unsigned int speed_unverified:1; + }); }; @@ -1291,6 +1295,7 @@ static void nodemgr_node_scan_one(struct ci = kmalloc(sizeof(*ci), GFP_KERNEL); if (!ci) return; + kmemcheck_annotate_bitfield(ci->flags); ci->host = host; ci->nodeid = nodeid; Index: linux-2.6-tip/drivers/infiniband/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/infiniband/Kconfig +++ linux-2.6-tip/drivers/infiniband/Kconfig @@ -2,6 +2,7 @@ menuconfig INFINIBAND tristate "InfiniBand support" depends on PCI || BROKEN depends on HAS_IOMEM + depends on 0 ---help--- Core support for InfiniBand (IB). Make sure to also select any protocols you wish to use as well as drivers for your Index: linux-2.6-tip/drivers/infiniband/hw/amso1100/c2_vq.c =================================================================== --- linux-2.6-tip.orig/drivers/infiniband/hw/amso1100/c2_vq.c +++ linux-2.6-tip/drivers/infiniband/hw/amso1100/c2_vq.c @@ -107,7 +107,7 @@ struct c2_vq_req *vq_req_alloc(struct c2 r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); if (r) { init_waitqueue_head(&r->wait_object); - r->reply_msg = (u64) NULL; + r->reply_msg = (u64) (long) NULL; r->event = 0; r->cm_id = NULL; r->qp = NULL; @@ -123,7 +123,7 @@ struct c2_vq_req *vq_req_alloc(struct c2 */ void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) { - r->reply_msg = (u64) NULL; + r->reply_msg = (u64) (long) NULL; if (atomic_dec_and_test(&r->refcnt)) { kfree(r); } @@ -151,7 +151,7 @@ void vq_req_get(struct c2_dev *c2dev, st void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) { if (atomic_dec_and_test(&r->refcnt)) { - if (r->reply_msg != (u64) NULL) + if (r->reply_msg != (u64) (long) NULL) vq_repbuf_free(c2dev, (void *) (unsigned long) r->reply_msg); kfree(r); @@ -258,3 +258,4 @@ void vq_repbuf_free(struct c2_dev *c2dev { kmem_cache_free(c2dev->host_msg_cache, reply); } + Index: linux-2.6-tip/drivers/infiniband/hw/ipath/ipath_driver.c =================================================================== --- linux-2.6-tip.orig/drivers/infiniband/hw/ipath/ipath_driver.c +++ linux-2.6-tip/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2715,7 +2715,7 @@ static void ipath_hol_signal_up(struct i * to prevent HoL blocking, then start the HoL timer that * periodically continues, then stop procs, so they can detect * link down if they want, and do something about it. - * Timer may already be running, so use __mod_timer, not add_timer. + * Timer may already be running, so use mod_timer, not add_timer. */ void ipath_hol_down(struct ipath_devdata *dd) { @@ -2724,7 +2724,7 @@ void ipath_hol_down(struct ipath_devdata dd->ipath_hol_next = IPATH_HOL_DOWNCONT; dd->ipath_hol_timer.expires = jiffies + msecs_to_jiffies(ipath_hol_timeout_ms); - __mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); + mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); } /* @@ -2763,7 +2763,7 @@ void ipath_hol_event(unsigned long opaqu else { dd->ipath_hol_timer.expires = jiffies + msecs_to_jiffies(ipath_hol_timeout_ms); - __mod_timer(&dd->ipath_hol_timer, + mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); } } Index: linux-2.6-tip/drivers/input/keyboard/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/input/keyboard/Kconfig +++ linux-2.6-tip/drivers/input/keyboard/Kconfig @@ -13,11 +13,11 @@ menuconfig INPUT_KEYBOARD if INPUT_KEYBOARD config KEYBOARD_ATKBD - tristate "AT keyboard" if EMBEDDED || !X86_PC + tristate "AT keyboard" if EMBEDDED || !X86 default y select SERIO select SERIO_LIBPS2 - select SERIO_I8042 if X86_PC + select SERIO_I8042 if X86 select SERIO_GSCPS2 if GSC help Say Y here if you want to use a standard AT or PS/2 keyboard. Usually Index: linux-2.6-tip/drivers/input/mouse/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/input/mouse/Kconfig +++ linux-2.6-tip/drivers/input/mouse/Kconfig @@ -17,7 +17,7 @@ config MOUSE_PS2 default y select SERIO select SERIO_LIBPS2 - select SERIO_I8042 if X86_PC + select SERIO_I8042 if X86 select SERIO_GSCPS2 if GSC help Say Y here if you have a PS/2 mouse connected to your system. This Index: linux-2.6-tip/drivers/input/touchscreen/htcpen.c =================================================================== --- linux-2.6-tip.orig/drivers/input/touchscreen/htcpen.c +++ linux-2.6-tip/drivers/input/touchscreen/htcpen.c @@ -47,12 +47,6 @@ static int invert_y; module_param(invert_y, bool, 0644); MODULE_PARM_DESC(invert_y, "If set, Y axis is inverted"); -static struct pnp_device_id pnp_ids[] = { - { .id = "PNP0cc0" }, - { .id = "" } -}; -MODULE_DEVICE_TABLE(pnp, pnp_ids); - static irqreturn_t htcpen_interrupt(int irq, void *handle) { struct input_dev *htcpen_dev = handle; @@ -253,3 +247,4 @@ static void __exit htcpen_isa_exit(void) module_init(htcpen_isa_init); module_exit(htcpen_isa_exit); + Index: linux-2.6-tip/drivers/isdn/capi/capidrv.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/capi/capidrv.c +++ linux-2.6-tip/drivers/isdn/capi/capidrv.c @@ -1551,8 +1551,8 @@ static int decodeFVteln(char *teln, unsi static int FVteln2capi20(char *teln, u8 AdditionalInfo[1+2+2+31]) { - unsigned long bmask; - int active; + unsigned long uninitialized_var(bmask); + int uninitialized_var(active); int rc, i; rc = decodeFVteln(teln, &bmask, &active); Index: linux-2.6-tip/drivers/isdn/hardware/eicon/maintidi.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/hardware/eicon/maintidi.c +++ linux-2.6-tip/drivers/isdn/hardware/eicon/maintidi.c @@ -959,7 +959,7 @@ static int process_idi_event (diva_strac } if (!strncmp("State\\Layer2 No1", path, pVar->path_length)) { char* tmp = &pLib->lines[0].pInterface->Layer2[0]; - dword l2_state; + dword uninitialized_var(l2_state); diva_strace_read_uint (pVar, &l2_state); switch (l2_state) { Index: linux-2.6-tip/drivers/isdn/hardware/eicon/message.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/hardware/eicon/message.c +++ linux-2.6-tip/drivers/isdn/hardware/eicon/message.c @@ -2682,7 +2682,7 @@ byte connect_b3_req(dword Id, word Numbe if (!(fax_control_bits & T30_CONTROL_BIT_MORE_DOCUMENTS) || (fax_feature_bits & T30_FEATURE_BIT_MORE_DOCUMENTS)) { - len = (byte)(&(((T30_INFO *) 0)->universal_6)); + len = (byte)(offsetof(T30_INFO, universal_6)); fax_info_change = false; if (ncpi->length >= 4) { @@ -2744,7 +2744,7 @@ byte connect_b3_req(dword Id, word Numbe for (i = 0; i < w; i++) ((T30_INFO *)(plci->fax_connect_info_buffer))->station_id[i] = fax_parms[4].info[1+i]; ((T30_INFO *)(plci->fax_connect_info_buffer))->head_line_len = 0; - len = (byte)(((T30_INFO *) 0)->station_id + 20); + len = (byte)(offsetof(T30_INFO, station_id) + 20); w = fax_parms[5].length; if (w > 20) w = 20; @@ -2778,7 +2778,7 @@ byte connect_b3_req(dword Id, word Numbe } else { - len = (byte)(&(((T30_INFO *) 0)->universal_6)); + len = (byte)(offsetof(T30_INFO, universal_6)); } fax_info_change = true; @@ -2881,7 +2881,7 @@ byte connect_b3_res(dword Id, word Numbe && (plci->nsf_control_bits & T30_NSF_CONTROL_BIT_ENABLE_NSF) && (plci->nsf_control_bits & T30_NSF_CONTROL_BIT_NEGOTIATE_RESP)) { - len = ((byte)(((T30_INFO *) 0)->station_id + 20)); + len = (byte)(offsetof(T30_INFO, station_id) + 20); if (plci->fax_connect_info_length < len) { ((T30_INFO *)(plci->fax_connect_info_buffer))->station_id_len = 0; @@ -3782,7 +3782,7 @@ static byte manufacturer_res(dword Id, w break; } ncpi = &m_parms[1]; - len = ((byte)(((T30_INFO *) 0)->station_id + 20)); + len = (byte)(offsetof(T30_INFO, station_id) + 20); if (plci->fax_connect_info_length < len) { ((T30_INFO *)(plci->fax_connect_info_buffer))->station_id_len = 0; @@ -6485,7 +6485,7 @@ static void nl_ind(PLCI *plci) word info = 0; word fax_feature_bits; byte fax_send_edata_ack; - static byte v120_header_buffer[2 + 3]; + static byte v120_header_buffer[2 + 3] __attribute__ ((aligned(8))); static word fax_info[] = { 0, /* T30_SUCCESS */ _FAX_NO_CONNECTION, /* T30_ERR_NO_DIS_RECEIVED */ @@ -6824,7 +6824,7 @@ static void nl_ind(PLCI *plci) if ((plci->requested_options_conn | plci->requested_options | a->requested_options_table[plci->appl->Id-1]) & ((1L << PRIVATE_FAX_SUB_SEP_PWD) | (1L << PRIVATE_FAX_NONSTANDARD))) { - i = ((word)(((T30_INFO *) 0)->station_id + 20)) + ((T30_INFO *)plci->NL.RBuffer->P)->head_line_len; + i = ((word)(offsetof(T30_INFO, station_id) + 20)) + ((T30_INFO *)plci->NL.RBuffer->P)->head_line_len; while (i < plci->NL.RBuffer->length) plci->ncpi_buffer[++len] = plci->NL.RBuffer->P[i++]; } @@ -7216,7 +7216,7 @@ static void nl_ind(PLCI *plci) { plci->RData[1].P = plci->RData[0].P; plci->RData[1].PLength = plci->RData[0].PLength; - plci->RData[0].P = v120_header_buffer + (-((int) v120_header_buffer) & 3); + plci->RData[0].P = v120_header_buffer; if ((plci->NL.RBuffer->P[0] & V120_HEADER_EXTEND_BIT) || (plci->NL.RLength == 1)) plci->RData[0].PLength = 1; else @@ -8395,6 +8395,7 @@ static word add_b23(PLCI *plci, API_PARS /* copy head line to NLC */ if(b3_config_parms[3].length) { + byte *head_line = (void *) ((T30_INFO *)&nlc[1] + 1); pos = (byte)(fax_head_line_time (&(((T30_INFO *)&nlc[1])->station_id[20]))); if (pos != 0) @@ -8403,17 +8404,17 @@ static word add_b23(PLCI *plci, API_PARS pos = 0; else { - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ' '; - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ' '; + head_line[pos++] = ' '; + head_line[pos++] = ' '; len = (byte)b3_config_parms[2].length; if (len > 20) len = 20; if (CAPI_MAX_DATE_TIME_LENGTH + 2 + len + 2 + b3_config_parms[3].length <= CAPI_MAX_HEAD_LINE_SPACE) { for (i = 0; i < len; i++) - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ((byte *)b3_config_parms[2].info)[1+i]; - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ' '; - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ' '; + head_line[pos++] = ((byte *)b3_config_parms[2].info)[1+i]; + head_line[pos++] = ' '; + head_line[pos++] = ' '; } } } @@ -8424,7 +8425,7 @@ static word add_b23(PLCI *plci, API_PARS ((T30_INFO *)&nlc[1])->head_line_len = (byte)(pos + len); nlc[0] += (byte)(pos + len); for (i = 0; i < len; i++) - ((T30_INFO *)&nlc[1])->station_id[20 + pos++] = ((byte *)b3_config_parms[3].info)[1+i]; + head_line[pos++] = ((byte *)b3_config_parms[3].info)[1+i]; } else ((T30_INFO *)&nlc[1])->head_line_len = 0; @@ -8453,7 +8454,7 @@ static word add_b23(PLCI *plci, API_PARS fax_control_bits |= T30_CONTROL_BIT_ACCEPT_SEL_POLLING; } len = nlc[0]; - pos = ((byte)(((T30_INFO *) 0)->station_id + 20)); + pos = (byte)(offsetof(T30_INFO, station_id) + 20); if (pos < plci->fax_connect_info_length) { for (i = 1 + plci->fax_connect_info_buffer[pos]; i != 0; i--) @@ -8505,7 +8506,7 @@ static word add_b23(PLCI *plci, API_PARS } PUT_WORD(&(((T30_INFO *)&nlc[1])->control_bits_low), fax_control_bits); - len = ((byte)(((T30_INFO *) 0)->station_id + 20)); + len = (byte)(offsetof(T30_INFO, station_id) + 20); for (i = 0; i < len; i++) plci->fax_connect_info_buffer[i] = nlc[1+i]; ((T30_INFO *) plci->fax_connect_info_buffer)->head_line_len = 0; @@ -15049,3 +15050,4 @@ static void diva_free_dma_descriptor (PL } /*------------------------------------------------------------------*/ + Index: linux-2.6-tip/drivers/isdn/hisax/config.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/hisax/config.c +++ linux-2.6-tip/drivers/isdn/hisax/config.c @@ -1980,7 +1980,7 @@ static struct pci_device_id hisax_pci_tb { } /* Terminating entry */ }; -MODULE_DEVICE_TABLE(pci, hisax_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, hisax_pci_tbl); #endif /* CONFIG_PCI */ module_init(HiSax_init); Index: linux-2.6-tip/drivers/isdn/i4l/isdn_common.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/i4l/isdn_common.c +++ linux-2.6-tip/drivers/isdn/i4l/isdn_common.c @@ -1280,7 +1280,9 @@ isdn_ioctl(struct inode *inode, struct f int ret; int i; char __user *p; +#ifdef CONFIG_NETDEVICES char *s; +#endif union iocpar { char name[10]; char bname[22]; Index: linux-2.6-tip/drivers/isdn/i4l/isdn_ppp.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/i4l/isdn_ppp.c +++ linux-2.6-tip/drivers/isdn/i4l/isdn_ppp.c @@ -466,7 +466,7 @@ static int get_filter(void __user *arg, *p = code; return uprog.len; } -#endif /* CONFIG_IPPP_FILTER */ +#endif /* * ippp device ioctl Index: linux-2.6-tip/drivers/isdn/icn/icn.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/icn/icn.c +++ linux-2.6-tip/drivers/isdn/icn/icn.c @@ -717,7 +717,7 @@ icn_sendbuf(int channel, int ack, struct return 0; if (card->sndcount[channel] > ICN_MAX_SQUEUE) return 0; - #warning TODO test headroom or use skb->nb to flag ACK + /* TODO test headroom or use skb->nb to flag ACK: */ nskb = skb_clone(skb, GFP_ATOMIC); if (nskb) { /* Push ACK flag as one Index: linux-2.6-tip/drivers/isdn/mISDN/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/isdn/mISDN/Kconfig +++ linux-2.6-tip/drivers/isdn/mISDN/Kconfig @@ -4,6 +4,9 @@ menuconfig MISDN tristate "Modular ISDN driver" + # broken with: + # http://redhat.com/~mingo/misc/config-Sun_Jul_27_08_30_16_CEST_2008.bad + depends on 0 help Enable support for the modular ISDN driver. Index: linux-2.6-tip/drivers/isdn/sc/card.h =================================================================== --- linux-2.6-tip.orig/drivers/isdn/sc/card.h +++ linux-2.6-tip/drivers/isdn/sc/card.h @@ -82,7 +82,7 @@ typedef struct { int ioport[MAX_IO_REGS]; /* Index to I/O ports */ int shmem_pgport; /* port for the exp mem page reg. */ int shmem_magic; /* adapter magic number */ - unsigned int rambase; /* Shared RAM base address */ + u8 __iomem *rambase; /* Shared RAM base address */ unsigned int ramsize; /* Size of shared memory */ RspMessage async_msg; /* Async response message */ int want_async_messages; /* Snoop the Q ? */ Index: linux-2.6-tip/drivers/isdn/sc/init.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/sc/init.c +++ linux-2.6-tip/drivers/isdn/sc/init.c @@ -27,7 +27,7 @@ static const char *boardname[] = { "Data /* insmod set parameters */ static unsigned int io[] = {0,0,0,0}; static unsigned char irq[] = {0,0,0,0}; -static unsigned long ram[] = {0,0,0,0}; +static u8 __iomem * ram[] = {0,0,0,0}; static int do_reset = 0; module_param_array(io, int, NULL, 0); @@ -35,7 +35,7 @@ module_param_array(irq, int, NULL, 0); module_param_array(ram, int, NULL, 0); module_param(do_reset, bool, 0); -static int identify_board(unsigned long, unsigned int); +static int identify_board(u8 __iomem *rambase, unsigned int iobase); static int __init sc_init(void) { @@ -153,7 +153,7 @@ static int __init sc_init(void) outb(0xFF, io[b] + RESET_OFFSET); msleep_interruptible(10000); } - pr_debug("RAM Base for board %d is 0x%lx, %s probe\n", b, + pr_debug("RAM Base for board %d is %p, %s probe\n", b, ram[b], ram[b] == 0 ? "will" : "won't"); if(ram[b]) { @@ -162,10 +162,10 @@ static int __init sc_init(void) * Just look for a signature and ID the * board model */ - if(request_region(ram[b], SRAM_PAGESIZE, "sc test")) { - pr_debug("request_region for RAM base 0x%lx succeeded\n", ram[b]); + if (request_region((unsigned long)ram[b], SRAM_PAGESIZE, "sc test")) { + pr_debug("request_region for RAM base %p succeeded\n", ram[b]); model = identify_board(ram[b], io[b]); - release_region(ram[b], SRAM_PAGESIZE); + release_region((unsigned long)ram[b], SRAM_PAGESIZE); } } else { @@ -177,12 +177,12 @@ static int __init sc_init(void) pr_debug("Checking RAM address 0x%x...\n", i); if(request_region(i, SRAM_PAGESIZE, "sc test")) { pr_debug(" request_region succeeded\n"); - model = identify_board(i, io[b]); + model = identify_board((u8 __iomem *)i, io[b]); release_region(i, SRAM_PAGESIZE); if (model >= 0) { pr_debug(" Identified a %s\n", boardname[model]); - ram[b] = i; + ram[b] = (u8 __iomem *)i; break; } pr_debug(" Unidentifed or inaccessible\n"); @@ -199,7 +199,7 @@ static int __init sc_init(void) * Nope, there was no place in RAM for the * board, or it couldn't be identified */ - pr_debug("Failed to find an adapter at 0x%lx\n", ram[b]); + pr_debug("Failed to find an adapter at %p\n", ram[b]); continue; } @@ -222,7 +222,7 @@ static int __init sc_init(void) features = BRI_FEATURES; break; } - switch(ram[b] >> 12 & 0x0F) { + switch((unsigned long)ram[b] >> 12 & 0x0F) { case 0x0: pr_debug("RAM Page register set to EXP_PAGE0\n"); pgport = EXP_PAGE0; @@ -358,10 +358,10 @@ static int __init sc_init(void) pr_debug("Requesting I/O Port %#x\n", sc_adapter[cinst]->ioport[IRQ_SELECT]); sc_adapter[cinst]->rambase = ram[b]; - request_region(sc_adapter[cinst]->rambase, SRAM_PAGESIZE, - interface->id); + request_region((unsigned long)sc_adapter[cinst]->rambase, + SRAM_PAGESIZE, interface->id); - pr_info(" %s (%d) - %s %d channels IRQ %d, I/O Base 0x%x, RAM Base 0x%lx\n", + pr_info(" %s (%d) - %s %d channels IRQ %d, I/O Base 0x%x, RAM Base %p\n", sc_adapter[cinst]->devicename, sc_adapter[cinst]->driverId, boardname[model], channels, irq[b], io[b], ram[b]); @@ -400,7 +400,7 @@ static void __exit sc_exit(void) /* * Release shared RAM */ - release_region(sc_adapter[i]->rambase, SRAM_PAGESIZE); + release_region((unsigned long)sc_adapter[i]->rambase, SRAM_PAGESIZE); /* * Release the IRQ @@ -434,7 +434,7 @@ static void __exit sc_exit(void) pr_info("SpellCaster ISA ISDN Adapter Driver Unloaded.\n"); } -static int identify_board(unsigned long rambase, unsigned int iobase) +static int identify_board(u8 __iomem *rambase, unsigned int iobase) { unsigned int pgport; unsigned long sig; @@ -444,15 +444,15 @@ static int identify_board(unsigned long HWConfig_pl hwci; int x; - pr_debug("Attempting to identify adapter @ 0x%lx io 0x%x\n", + pr_debug("Attempting to identify adapter @ %p io 0x%x\n", rambase, iobase); /* * Enable the base pointer */ - outb(rambase >> 12, iobase + 0x2c00); + outb((unsigned long)rambase >> 12, iobase + 0x2c00); - switch(rambase >> 12 & 0x0F) { + switch((unsigned long)rambase >> 12 & 0x0F) { case 0x0: pgport = iobase + PG0_OFFSET; pr_debug("Page Register offset is 0x%x\n", PG0_OFFSET); @@ -473,7 +473,7 @@ static int identify_board(unsigned long pr_debug("Page Register offset is 0x%x\n", PG3_OFFSET); break; default: - pr_debug("Invalid rambase 0x%lx\n", rambase); + pr_debug("Invalid rambase %p\n", rambase); return -1; } Index: linux-2.6-tip/drivers/isdn/sc/scioc.h =================================================================== --- linux-2.6-tip.orig/drivers/isdn/sc/scioc.h +++ linux-2.6-tip/drivers/isdn/sc/scioc.h @@ -86,7 +86,7 @@ typedef struct { char load_ver[11]; char proc_ver[11]; int iobase; - long rambase; + u8 __iomem *rambase; char irq; long ramsize; char interface; Index: linux-2.6-tip/drivers/isdn/sc/shmem.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/sc/shmem.c +++ linux-2.6-tip/drivers/isdn/sc/shmem.c @@ -54,7 +54,7 @@ void memcpy_toshmem(int card, void *dest spin_unlock_irqrestore(&sc_adapter[card]->lock, flags); pr_debug("%s: set page to %#x\n",sc_adapter[card]->devicename, ((sc_adapter[card]->shmem_magic + ch * SRAM_PAGESIZE)>>14)|0x80); - pr_debug("%s: copying %d bytes from %#lx to %#lx\n", + pr_debug("%s: copying %d bytes from %#lx to %p\n", sc_adapter[card]->devicename, n, (unsigned long) src, sc_adapter[card]->rambase + ((unsigned long) dest %0x4000)); Index: linux-2.6-tip/drivers/isdn/sc/timer.c =================================================================== --- linux-2.6-tip.orig/drivers/isdn/sc/timer.c +++ linux-2.6-tip/drivers/isdn/sc/timer.c @@ -27,7 +27,7 @@ static void setup_ports(int card) { - outb((sc_adapter[card]->rambase >> 12), sc_adapter[card]->ioport[EXP_BASE]); + outb(((long)sc_adapter[card]->rambase >> 12), sc_adapter[card]->ioport[EXP_BASE]); /* And the IRQ */ outb((sc_adapter[card]->interrupt | 0x80), Index: linux-2.6-tip/drivers/lguest/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/lguest/Kconfig +++ linux-2.6-tip/drivers/lguest/Kconfig @@ -1,6 +1,6 @@ config LGUEST tristate "Linux hypervisor example code" - depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX && !X86_VOYAGER + depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX select HVC_DRIVER ---help--- This is a very simple module which allows you to run Index: linux-2.6-tip/drivers/md/dm-raid1.c =================================================================== --- linux-2.6-tip.orig/drivers/md/dm-raid1.c +++ linux-2.6-tip/drivers/md/dm-raid1.c @@ -921,7 +921,7 @@ static int parse_features(struct mirror_ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv) { int r; - unsigned int nr_mirrors, m, args_used; + unsigned int nr_mirrors, m, uninitialized_var(args_used); struct mirror_set *ms; struct dm_dirty_log *dl; Index: linux-2.6-tip/drivers/media/dvb/dvb-usb/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/media/dvb/dvb-usb/Kconfig +++ linux-2.6-tip/drivers/media/dvb/dvb-usb/Kconfig @@ -235,6 +235,7 @@ config DVB_USB_OPERA1 config DVB_USB_AF9005 tristate "Afatech AF9005 DVB-T USB1.1 support" depends on DVB_USB && EXPERIMENTAL + depends on 0 select MEDIA_TUNER_MT2060 if !MEDIA_TUNER_CUSTOMIZE select MEDIA_TUNER_QT1010 if !MEDIA_TUNER_CUSTOMIZE help Index: linux-2.6-tip/drivers/media/video/cx88/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/media/video/cx88/Kconfig +++ linux-2.6-tip/drivers/media/video/cx88/Kconfig @@ -1,6 +1,8 @@ config VIDEO_CX88 tristate "Conexant 2388x (bt878 successor) support" depends on VIDEO_DEV && PCI && I2C && INPUT + # build failure, see config-Mon_Oct_20_13_45_14_CEST_2008.bad + depends on BROKEN select I2C_ALGOBIT select VIDEO_BTCX select VIDEOBUF_DMA_SG Index: linux-2.6-tip/drivers/memstick/core/mspro_block.c =================================================================== --- linux-2.6-tip.orig/drivers/memstick/core/mspro_block.c +++ linux-2.6-tip/drivers/memstick/core/mspro_block.c @@ -651,6 +651,7 @@ has_int_reg: default: BUG(); + return -EINVAL; } } Index: linux-2.6-tip/drivers/message/fusion/mptbase.c =================================================================== --- linux-2.6-tip.orig/drivers/message/fusion/mptbase.c +++ linux-2.6-tip/drivers/message/fusion/mptbase.c @@ -126,7 +126,9 @@ static int mfcounter = 0; * Public data... */ +#ifdef CONFIG_PROC_FS static struct proc_dir_entry *mpt_proc_root_dir; +#endif #define WHOINIT_UNKNOWN 0xAA Index: linux-2.6-tip/drivers/message/i2o/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/message/i2o/Kconfig +++ linux-2.6-tip/drivers/message/i2o/Kconfig @@ -54,7 +54,7 @@ config I2O_EXT_ADAPTEC_DMA64 config I2O_CONFIG tristate "I2O Configuration support" - depends on VIRT_TO_BUS + depends on VIRT_TO_BUS && (BROKEN || !64BIT) ---help--- Say Y for support of the configuration interface for the I2O adapters. If you have a RAID controller from Adaptec and you want to use the @@ -66,6 +66,8 @@ config I2O_CONFIG Note: If you want to use the new API you have to download the i2o_config patch from http://i2o.shadowconnect.com/ + Note: This is broken on 64-bit architectures. + config I2O_CONFIG_OLD_IOCTL bool "Enable ioctls (OBSOLETE)" depends on I2O_CONFIG Index: linux-2.6-tip/drivers/mfd/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/mfd/Kconfig +++ linux-2.6-tip/drivers/mfd/Kconfig @@ -210,6 +210,8 @@ config MFD_WM8350_I2C tristate "Support Wolfson Microelectronics WM8350 with I2C" select MFD_WM8350 depends on I2C + # build failure + depends on 0 help The WM8350 is an integrated audio and power management subsystem with watchdog and RTC functionality for embedded Index: linux-2.6-tip/drivers/mfd/da903x.c =================================================================== --- linux-2.6-tip.orig/drivers/mfd/da903x.c +++ linux-2.6-tip/drivers/mfd/da903x.c @@ -75,6 +75,7 @@ static inline int __da903x_read(struct i { int ret; + *val = 0; ret = i2c_smbus_read_byte_data(client, reg); if (ret < 0) { dev_err(&client->dev, "failed reading at 0x%02x\n", reg); Index: linux-2.6-tip/drivers/misc/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/misc/Kconfig +++ linux-2.6-tip/drivers/misc/Kconfig @@ -162,7 +162,7 @@ config ENCLOSURE_SERVICES config SGI_XP tristate "Support communication between SGI SSIs" depends on NET - depends on (IA64_GENERIC || IA64_SGI_SN2 || IA64_SGI_UV || X86_64) && SMP + depends on (IA64_GENERIC || IA64_SGI_SN2 || IA64_SGI_UV || X86_UV) && SMP select IA64_UNCACHED_ALLOCATOR if IA64_GENERIC || IA64_SGI_SN2 select GENERIC_ALLOCATOR if IA64_GENERIC || IA64_SGI_SN2 select SGI_GRU if (IA64_GENERIC || IA64_SGI_UV || X86_64) && SMP @@ -189,7 +189,7 @@ config HP_ILO config SGI_GRU tristate "SGI GRU driver" - depends on (X86_64 || IA64_SGI_UV || IA64_GENERIC) && SMP + depends on (X86_UV || IA64_SGI_UV || IA64_GENERIC) && SMP default n select MMU_NOTIFIER ---help--- @@ -218,6 +218,8 @@ config DELL_LAPTOP depends on BACKLIGHT_CLASS_DEVICE depends on RFKILL depends on POWER_SUPPLY + # broken build with: config-Thu_Jan_15_01_30_52_CET_2009.bad + depends on 0 default n ---help--- This driver adds support for rfkill and backlight control to Dell Index: linux-2.6-tip/drivers/misc/ics932s401.c =================================================================== --- linux-2.6-tip.orig/drivers/misc/ics932s401.c +++ linux-2.6-tip/drivers/misc/ics932s401.c @@ -374,7 +374,7 @@ static ssize_t show_value(struct device struct device_attribute *devattr, char *buf) { - int x; + int x = 0; if (devattr == &dev_attr_usb_clock) x = 48000; @@ -392,7 +392,7 @@ static ssize_t show_spread(struct device { struct ics932s401_data *data = ics932s401_update_device(dev); int reg; - unsigned long val; + unsigned long val = 0; if (!(data->regs[ICS932S401_REG_CFG2] & ICS932S401_CFG1_SPREAD)) return sprintf(buf, "0%%\n"); Index: linux-2.6-tip/drivers/misc/sgi-gru/grufault.c =================================================================== --- linux-2.6-tip.orig/drivers/misc/sgi-gru/grufault.c +++ linux-2.6-tip/drivers/misc/sgi-gru/grufault.c @@ -282,8 +282,8 @@ static int gru_try_dropin(struct gru_thr { struct mm_struct *mm = gts->ts_mm; struct vm_area_struct *vma; - int pageshift, asid, write, ret; - unsigned long paddr, gpa, vaddr; + int uninitialized_var(pageshift), asid, write, ret; + unsigned long uninitialized_var(paddr), gpa, vaddr; /* * NOTE: The GRU contains magic hardware that eliminates races between Index: linux-2.6-tip/drivers/misc/sgi-gru/grufile.c =================================================================== --- linux-2.6-tip.orig/drivers/misc/sgi-gru/grufile.c +++ linux-2.6-tip/drivers/misc/sgi-gru/grufile.c @@ -36,23 +36,11 @@ #include #include #include +#include #include "gru.h" #include "grulib.h" #include "grutables.h" -#if defined CONFIG_X86_64 -#include -#include -#define IS_UV() is_uv_system() -#elif defined CONFIG_IA64 -#include -#include -/* temp support for running on hardware simulator */ -#define IS_UV() IS_MEDUSA() || ia64_platform_is("uv") -#else -#define IS_UV() 0 -#endif - #include #include @@ -381,7 +369,7 @@ static int __init gru_init(void) char id[10]; void *gru_start_vaddr; - if (!IS_UV()) + if (!is_uv_system()) return 0; #if defined CONFIG_IA64 @@ -451,7 +439,7 @@ static void __exit gru_exit(void) int order = get_order(sizeof(struct gru_state) * GRU_CHIPLETS_PER_BLADE); - if (!IS_UV()) + if (!is_uv_system()) return; for (i = 0; i < GRU_CHIPLETS_PER_BLADE; i++) Index: linux-2.6-tip/drivers/misc/sgi-xp/xp.h =================================================================== --- linux-2.6-tip.orig/drivers/misc/sgi-xp/xp.h +++ linux-2.6-tip/drivers/misc/sgi-xp/xp.h @@ -15,19 +15,19 @@ #include -#ifdef CONFIG_IA64 +#if defined CONFIG_X86_UV || defined CONFIG_IA64_SGI_UV +#include +#define is_uv() is_uv_system() +#endif + +#ifndef is_uv +#define is_uv() 0 +#endif + +#if defined CONFIG_IA64 #include #include /* defines is_shub1() and is_shub2() */ #define is_shub() ia64_platform_is("sn2") -#ifdef CONFIG_IA64_SGI_UV -#define is_uv() ia64_platform_is("uv") -#else -#define is_uv() 0 -#endif -#endif -#ifdef CONFIG_X86_64 -#include -#define is_uv() is_uv_system() #endif #ifndef is_shub1 @@ -42,10 +42,6 @@ #define is_shub() 0 #endif -#ifndef is_uv -#define is_uv() 0 -#endif - #ifdef USE_DBUG_ON #define DBUG_ON(condition) BUG_ON(condition) #else Index: linux-2.6-tip/drivers/misc/sgi-xp/xpc_main.c =================================================================== --- linux-2.6-tip.orig/drivers/misc/sgi-xp/xpc_main.c +++ linux-2.6-tip/drivers/misc/sgi-xp/xpc_main.c @@ -318,7 +318,7 @@ xpc_hb_checker(void *ignore) /* this thread was marked active by xpc_hb_init() */ - set_cpus_allowed_ptr(current, &cpumask_of_cpu(XPC_HB_CHECK_CPU)); + set_cpus_allowed_ptr(current, cpumask_of(XPC_HB_CHECK_CPU)); /* set our heartbeating to other partitions into motion */ xpc_hb_check_timeout = jiffies + (xpc_hb_check_interval * HZ); Index: linux-2.6-tip/drivers/mtd/devices/mtd_dataflash.c =================================================================== --- linux-2.6-tip.orig/drivers/mtd/devices/mtd_dataflash.c +++ linux-2.6-tip/drivers/mtd/devices/mtd_dataflash.c @@ -679,7 +679,7 @@ add_dataflash_otp(struct spi_device *spi dev_set_drvdata(&spi->dev, priv); if (mtd_has_partitions()) { - struct mtd_partition *parts; + struct mtd_partition *uninitialized_var(parts); int nr_parts = 0; #ifdef CONFIG_MTD_CMDLINE_PARTS Index: linux-2.6-tip/drivers/mtd/devices/phram.c =================================================================== --- linux-2.6-tip.orig/drivers/mtd/devices/phram.c +++ linux-2.6-tip/drivers/mtd/devices/phram.c @@ -235,7 +235,7 @@ static int phram_setup(const char *val, { char buf[64+12+12], *str = buf; char *token[3]; - char *name; + char *uninitialized_var(name); uint32_t start; uint32_t len; int i, ret; Index: linux-2.6-tip/drivers/mtd/lpddr/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/mtd/lpddr/Kconfig +++ linux-2.6-tip/drivers/mtd/lpddr/Kconfig @@ -12,6 +12,7 @@ config MTD_LPDDR DDR memories, intended for battery-operated systems. config MTD_QINFO_PROBE + depends on MTD_LPDDR tristate "Detect flash chips by QINFO probe" help Device Information for LPDDR chips is offered through the Overlay Index: linux-2.6-tip/drivers/mtd/nand/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/mtd/nand/Kconfig +++ linux-2.6-tip/drivers/mtd/nand/Kconfig @@ -273,7 +273,7 @@ config MTD_NAND_CAFE config MTD_NAND_CS553X tristate "NAND support for CS5535/CS5536 (AMD Geode companion chip)" - depends on X86_32 && (X86_PC || X86_GENERICARCH) + depends on X86_32 help The CS553x companion chips for the AMD Geode processor include NAND flash controllers with built-in hardware ECC Index: linux-2.6-tip/drivers/net/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/net/Kconfig +++ linux-2.6-tip/drivers/net/Kconfig @@ -776,6 +776,8 @@ config NET_VENDOR_SMC config WD80x3 tristate "WD80*3 support" depends on NET_VENDOR_SMC && ISA + # broken build + depends on 0 select CRC32 help If you have a network (Ethernet) card of this type, say Y and read @@ -1151,6 +1153,8 @@ config EEXPRESS_PRO config HPLAN_PLUS tristate "HP PCLAN+ (27247B and 27252A) support" depends on NET_ISA + # broken build with config-Mon_Jul_21_20_21_08_CEST_2008.bad + depends on 0 select CRC32 help If you have a network (Ethernet) card of this type, say Y and read @@ -2537,6 +2541,8 @@ config MYRI10GE_DCA config NETXEN_NIC tristate "NetXen Multi port (1/10) Gigabit Ethernet NIC" + # build breakage + depends on 0 depends on PCI help This enables the support for NetXen's Gigabit Ethernet card. Index: linux-2.6-tip/drivers/net/Makefile =================================================================== --- linux-2.6-tip.orig/drivers/net/Makefile +++ linux-2.6-tip/drivers/net/Makefile @@ -109,7 +109,7 @@ ifeq ($(CONFIG_FEC_MPC52xx_MDIO),y) obj-$(CONFIG_FEC_MPC52xx) += fec_mpc52xx_phy.o endif obj-$(CONFIG_68360_ENET) += 68360enet.o -obj-$(CONFIG_WD80x3) += wd.o 8390.o +obj-$(CONFIG_WD80x3) += wd.o 8390p.o obj-$(CONFIG_EL2) += 3c503.o 8390p.o obj-$(CONFIG_NE2000) += ne.o 8390p.o obj-$(CONFIG_NE2_MCA) += ne2.o 8390p.o Index: linux-2.6-tip/drivers/net/ne3210.c =================================================================== --- linux-2.6-tip.orig/drivers/net/ne3210.c +++ linux-2.6-tip/drivers/net/ne3210.c @@ -150,7 +150,8 @@ static int __init ne3210_eisa_probe (str if (phys_mem < virt_to_phys(high_memory)) { printk(KERN_CRIT "ne3210.c: Card RAM overlaps with normal memory!!!\n"); printk(KERN_CRIT "ne3210.c: Use EISA SCU to set card memory below 1MB,\n"); - printk(KERN_CRIT "ne3210.c: or to an address above 0x%lx.\n", virt_to_phys(high_memory)); + printk(KERN_CRIT "ne3210.c: or to an address above 0x%llx.\n", + (u64)virt_to_phys(high_memory)); printk(KERN_CRIT "ne3210.c: Driver NOT installed.\n"); retval = -EINVAL; goto out3; Index: linux-2.6-tip/drivers/net/sfc/efx.c =================================================================== --- linux-2.6-tip.orig/drivers/net/sfc/efx.c +++ linux-2.6-tip/drivers/net/sfc/efx.c @@ -854,20 +854,27 @@ static void efx_fini_io(struct efx_nic * * interrupts across them. */ static int efx_wanted_rx_queues(void) { - cpumask_t core_mask; + cpumask_var_t core_mask; int count; int cpu; - cpus_clear(core_mask); + if (!alloc_cpumask_var(&core_mask, GFP_KERNEL)) { + printk(KERN_WARNING + "efx.c: allocation failure, irq balancing hobbled\n"); + return 1; + } + + cpumask_clear(core_mask); count = 0; for_each_online_cpu(cpu) { - if (!cpu_isset(cpu, core_mask)) { + if (!cpumask_test_cpu(cpu, core_mask)) { ++count; - cpus_or(core_mask, core_mask, - topology_core_siblings(cpu)); + cpumask_or(core_mask, core_mask, + topology_core_cpumask(cpu)); } } + free_cpumask_var(core_mask); return count; } Index: linux-2.6-tip/drivers/net/sfc/falcon.c =================================================================== --- linux-2.6-tip.orig/drivers/net/sfc/falcon.c +++ linux-2.6-tip/drivers/net/sfc/falcon.c @@ -338,10 +338,10 @@ static int falcon_alloc_special_buffer(s nic_data->next_buffer_table += buffer->entries; EFX_LOG(efx, "allocating special buffers %d-%d at %llx+%x " - "(virt %p phys %lx)\n", buffer->index, + "(virt %p phys %llx)\n", buffer->index, buffer->index + buffer->entries - 1, - (unsigned long long)buffer->dma_addr, len, - buffer->addr, virt_to_phys(buffer->addr)); + (u64)buffer->dma_addr, len, + buffer->addr, (u64)virt_to_phys(buffer->addr)); return 0; } @@ -353,10 +353,10 @@ static void falcon_free_special_buffer(s return; EFX_LOG(efx, "deallocating special buffers %d-%d at %llx+%x " - "(virt %p phys %lx)\n", buffer->index, + "(virt %p phys %llx)\n", buffer->index, buffer->index + buffer->entries - 1, - (unsigned long long)buffer->dma_addr, buffer->len, - buffer->addr, virt_to_phys(buffer->addr)); + (u64)buffer->dma_addr, buffer->len, + buffer->addr, (u64)virt_to_phys(buffer->addr)); pci_free_consistent(efx->pci_dev, buffer->len, buffer->addr, buffer->dma_addr); @@ -2343,10 +2343,10 @@ int falcon_probe_port(struct efx_nic *ef FALCON_MAC_STATS_SIZE); if (rc) return rc; - EFX_LOG(efx, "stats buffer at %llx (virt %p phys %lx)\n", - (unsigned long long)efx->stats_buffer.dma_addr, + EFX_LOG(efx, "stats buffer at %llx (virt %p phys %llx)\n", + (u64)efx->stats_buffer.dma_addr, efx->stats_buffer.addr, - virt_to_phys(efx->stats_buffer.addr)); + (u64)virt_to_phys(efx->stats_buffer.addr)); return 0; } @@ -2921,9 +2921,9 @@ int falcon_probe_nic(struct efx_nic *efx goto fail4; BUG_ON(efx->irq_status.dma_addr & 0x0f); - EFX_LOG(efx, "INT_KER at %llx (virt %p phys %lx)\n", - (unsigned long long)efx->irq_status.dma_addr, - efx->irq_status.addr, virt_to_phys(efx->irq_status.addr)); + EFX_LOG(efx, "INT_KER at %llx (virt %p phys %llx)\n", + (u64)efx->irq_status.dma_addr, + efx->irq_status.addr, (u64)virt_to_phys(efx->irq_status.addr)); falcon_probe_spi_devices(efx); Index: linux-2.6-tip/drivers/net/sky2.c =================================================================== --- linux-2.6-tip.orig/drivers/net/sky2.c +++ linux-2.6-tip/drivers/net/sky2.c @@ -2748,7 +2748,7 @@ static u32 sky2_mhz(const struct sky2_hw return 156; default: - BUG(); + panic("sky2_mhz: unknown chip id!"); } } Index: linux-2.6-tip/drivers/net/wimax/i2400m/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/net/wimax/i2400m/Kconfig +++ linux-2.6-tip/drivers/net/wimax/i2400m/Kconfig @@ -13,6 +13,8 @@ comment "Enable MMC support to see WiMAX config WIMAX_I2400M_USB tristate "Intel Wireless WiMAX Connection 2400 over USB (including 5x50)" depends on WIMAX && USB + # build failure: config-Thu_Jan__8_10_51_13_CET_2009.bad + depends on 0 select WIMAX_I2400M help Select if you have a device based on the Intel WiMAX Index: linux-2.6-tip/drivers/net/wireless/arlan-main.c =================================================================== --- linux-2.6-tip.orig/drivers/net/wireless/arlan-main.c +++ linux-2.6-tip/drivers/net/wireless/arlan-main.c @@ -1082,8 +1082,8 @@ static int __init arlan_probe_here(struc if (arlan_check_fingerprint(memaddr)) return -ENODEV; - printk(KERN_NOTICE "%s: Arlan found at %x, \n ", dev->name, - (int) virt_to_phys((void*)memaddr)); + printk(KERN_NOTICE "%s: Arlan found at %llx, \n ", dev->name, + (u64) virt_to_phys((void*)memaddr)); ap->card = (void *) memaddr; dev->mem_start = memaddr; Index: linux-2.6-tip/drivers/net/wireless/b43/b43.h =================================================================== --- linux-2.6-tip.orig/drivers/net/wireless/b43/b43.h +++ linux-2.6-tip/drivers/net/wireless/b43/b43.h @@ -852,7 +852,8 @@ void b43warn(struct b43_wl *wl, const ch void b43dbg(struct b43_wl *wl, const char *fmt, ...) __attribute__ ((format(printf, 2, 3))); #else /* DEBUG */ -# define b43dbg(wl, fmt...) do { /* nothing */ } while (0) +static inline void __attribute__ ((format(printf, 2, 3))) +b43dbg(struct b43_wl *wl, const char *fmt, ...) { } #endif /* DEBUG */ /* A WARN_ON variant that vanishes when b43 debugging is disabled. Index: linux-2.6-tip/drivers/net/wireless/ray_cs.c =================================================================== --- linux-2.6-tip.orig/drivers/net/wireless/ray_cs.c +++ linux-2.6-tip/drivers/net/wireless/ray_cs.c @@ -294,7 +294,9 @@ static char hop_pattern_length[] = { 1, JAPAN_TEST_HOP_MOD }; +#ifdef CONFIG_PROC_FS static char rcsid[] = "Raylink/WebGear wireless LAN - Corey "; +#endif /*============================================================================= ray_attach() creates an "instance" of the driver, allocating Index: linux-2.6-tip/drivers/net/wireless/rt2x00/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/net/wireless/rt2x00/Kconfig +++ linux-2.6-tip/drivers/net/wireless/rt2x00/Kconfig @@ -1,6 +1,7 @@ menuconfig RT2X00 tristate "Ralink driver support" depends on MAC80211 && WLAN_80211 && EXPERIMENTAL + depends on BROKEN ---help--- This will enable the experimental support for the Ralink drivers, developed in the rt2x00 project . Index: linux-2.6-tip/drivers/net/wireless/zd1201.c =================================================================== --- linux-2.6-tip.orig/drivers/net/wireless/zd1201.c +++ linux-2.6-tip/drivers/net/wireless/zd1201.c @@ -593,6 +593,9 @@ static inline int zd1201_getconfig16(str int err; __le16 zdval; + /* initialize */ + *val = 0; + err = zd1201_getconfig(zd, rid, &zdval, sizeof(__le16)); if (err) return err; Index: linux-2.6-tip/drivers/oprofile/buffer_sync.c =================================================================== --- linux-2.6-tip.orig/drivers/oprofile/buffer_sync.c +++ linux-2.6-tip/drivers/oprofile/buffer_sync.c @@ -38,7 +38,7 @@ static LIST_HEAD(dying_tasks); static LIST_HEAD(dead_tasks); -static cpumask_t marked_cpus = CPU_MASK_NONE; +static cpumask_var_t marked_cpus; static DEFINE_SPINLOCK(task_mortuary); static void process_task_mortuary(void); @@ -154,6 +154,10 @@ int sync_start(void) { int err; + if (!alloc_cpumask_var(&marked_cpus, GFP_KERNEL)) + return -ENOMEM; + cpumask_clear(marked_cpus); + start_cpu_work(); err = task_handoff_register(&task_free_nb); @@ -179,6 +183,7 @@ out2: task_handoff_unregister(&task_free_nb); out1: end_sync(); + free_cpumask_var(marked_cpus); goto out; } @@ -190,6 +195,7 @@ void sync_stop(void) profile_event_unregister(PROFILE_TASK_EXIT, &task_exit_nb); task_handoff_unregister(&task_free_nb); end_sync(); + free_cpumask_var(marked_cpus); } @@ -456,10 +462,10 @@ static void mark_done(int cpu) { int i; - cpu_set(cpu, marked_cpus); + cpumask_set_cpu(cpu, marked_cpus); for_each_online_cpu(i) { - if (!cpu_isset(i, marked_cpus)) + if (!cpumask_test_cpu(i, marked_cpus)) return; } @@ -468,7 +474,7 @@ static void mark_done(int cpu) */ process_task_mortuary(); - cpus_clear(marked_cpus); + cpumask_clear(marked_cpus); } Index: linux-2.6-tip/drivers/oprofile/cpu_buffer.c =================================================================== --- linux-2.6-tip.orig/drivers/oprofile/cpu_buffer.c +++ linux-2.6-tip/drivers/oprofile/cpu_buffer.c @@ -161,7 +161,7 @@ struct op_sample { entry->event = ring_buffer_lock_reserve (op_ring_buffer_write, sizeof(struct op_sample) + - size * sizeof(entry->sample->data[0]), &entry->irq_flags); + size * sizeof(entry->sample->data[0])); if (entry->event) entry->sample = ring_buffer_event_data(entry->event); else @@ -178,8 +178,7 @@ struct op_sample int op_cpu_buffer_write_commit(struct op_entry *entry) { - return ring_buffer_unlock_commit(op_ring_buffer_write, entry->event, - entry->irq_flags); + return ring_buffer_unlock_commit(op_ring_buffer_write, entry->event); } struct op_sample *op_cpu_buffer_read_entry(struct op_entry *entry, int cpu) Index: linux-2.6-tip/drivers/pci/dmar.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/dmar.c +++ linux-2.6-tip/drivers/pci/dmar.c @@ -42,6 +42,7 @@ LIST_HEAD(dmar_drhd_units); static struct acpi_table_header * __initdata dmar_tbl; +static acpi_size dmar_tbl_size; static void __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd) { @@ -288,8 +289,9 @@ static int __init dmar_table_detect(void acpi_status status = AE_OK; /* if we could find DMAR table, then there are DMAR devices */ - status = acpi_get_table(ACPI_SIG_DMAR, 0, - (struct acpi_table_header **)&dmar_tbl); + status = acpi_get_table_with_size(ACPI_SIG_DMAR, 0, + (struct acpi_table_header **)&dmar_tbl, + &dmar_tbl_size); if (ACPI_SUCCESS(status) && !dmar_tbl) { printk (KERN_WARNING PREFIX "Unable to map DMAR\n"); @@ -481,6 +483,7 @@ void __init detect_intel_iommu(void) iommu_detected = 1; #endif } + early_acpi_os_unmap_memory(dmar_tbl, dmar_tbl_size); dmar_tbl = NULL; } Index: linux-2.6-tip/drivers/pci/hotplug/cpqphp.h =================================================================== --- linux-2.6-tip.orig/drivers/pci/hotplug/cpqphp.h +++ linux-2.6-tip/drivers/pci/hotplug/cpqphp.h @@ -449,7 +449,7 @@ extern u8 cpqhp_disk_irq; /* inline functions */ -static inline char *slot_name(struct slot *slot) +static inline const char *slot_name(struct slot *slot) { return hotplug_slot_name(slot->hotplug_slot); } Index: linux-2.6-tip/drivers/pci/hotplug/ibmphp_core.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/hotplug/ibmphp_core.c +++ linux-2.6-tip/drivers/pci/hotplug/ibmphp_core.c @@ -1419,3 +1419,4 @@ static void __exit ibmphp_exit(void) } module_init(ibmphp_init); +module_exit(ibmphp_exit); Index: linux-2.6-tip/drivers/pci/intel-iommu.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/intel-iommu.c +++ linux-2.6-tip/drivers/pci/intel-iommu.c @@ -2284,11 +2284,13 @@ error: return 0; } -dma_addr_t intel_map_single(struct device *hwdev, phys_addr_t paddr, - size_t size, int dir) +static dma_addr_t intel_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) { - return __intel_map_single(hwdev, paddr, size, dir, - to_pci_dev(hwdev)->dma_mask); + return __intel_map_single(dev, page_to_phys(page) + offset, size, + dir, to_pci_dev(dev)->dma_mask); } static void flush_unmaps(void) @@ -2352,8 +2354,9 @@ static void add_unmap(struct dmar_domain spin_unlock_irqrestore(&async_umap_flush_lock, flags); } -void intel_unmap_single(struct device *dev, dma_addr_t dev_addr, size_t size, - int dir) +static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { struct pci_dev *pdev = to_pci_dev(dev); struct dmar_domain *domain; @@ -2397,8 +2400,14 @@ void intel_unmap_single(struct device *d } } -void *intel_alloc_coherent(struct device *hwdev, size_t size, - dma_addr_t *dma_handle, gfp_t flags) +static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr, size_t size, + int dir) +{ + intel_unmap_page(dev, dev_addr, size, dir, NULL); +} + +static void *intel_alloc_coherent(struct device *hwdev, size_t size, + dma_addr_t *dma_handle, gfp_t flags) { void *vaddr; int order; @@ -2421,8 +2430,8 @@ void *intel_alloc_coherent(struct device return NULL; } -void intel_free_coherent(struct device *hwdev, size_t size, void *vaddr, - dma_addr_t dma_handle) +static void intel_free_coherent(struct device *hwdev, size_t size, void *vaddr, + dma_addr_t dma_handle) { int order; @@ -2435,8 +2444,9 @@ void intel_free_coherent(struct device * #define SG_ENT_VIRT_ADDRESS(sg) (sg_virt((sg))) -void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, - int nelems, int dir) +static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs) { int i; struct pci_dev *pdev = to_pci_dev(hwdev); @@ -2493,8 +2503,8 @@ static int intel_nontranslate_map_sg(str return nelems; } -int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, - int dir) +static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int nelems, + enum dma_data_direction dir, struct dma_attrs *attrs) { void *addr; int i; @@ -2574,13 +2584,19 @@ int intel_map_sg(struct device *hwdev, s return nelems; } -static struct dma_mapping_ops intel_dma_ops = { +static int intel_mapping_error(struct device *dev, dma_addr_t dma_addr) +{ + return !dma_addr; +} + +struct dma_map_ops intel_dma_ops = { .alloc_coherent = intel_alloc_coherent, .free_coherent = intel_free_coherent, - .map_single = intel_map_single, - .unmap_single = intel_unmap_single, .map_sg = intel_map_sg, .unmap_sg = intel_unmap_sg, + .map_page = intel_map_page, + .unmap_page = intel_unmap_page, + .mapping_error = intel_mapping_error, }; static inline int iommu_domain_cache_init(void) Index: linux-2.6-tip/drivers/pci/intr_remapping.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/intr_remapping.c +++ linux-2.6-tip/drivers/pci/intr_remapping.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include "intr_remapping.h" @@ -20,7 +21,7 @@ struct irq_2_iommu { u8 irte_mask; }; -#ifdef CONFIG_SPARSE_IRQ +#ifdef CONFIG_GENERIC_HARDIRQS static struct irq_2_iommu *get_one_free_irq_2_iommu(int cpu) { struct irq_2_iommu *iommu; Index: linux-2.6-tip/drivers/pci/search.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/search.c +++ linux-2.6-tip/drivers/pci/search.c @@ -277,8 +277,12 @@ static struct pci_dev *pci_get_dev_by_id match_pci_dev_by_id); if (dev) pdev = to_pci_dev(dev); + + /* + * FIXME: take the cast off, when pci_dev_put() is made const: + */ if (from) - pci_dev_put(from); + pci_dev_put((struct pci_dev *)from); return pdev; } Index: linux-2.6-tip/drivers/platform/x86/fujitsu-laptop.c =================================================================== --- linux-2.6-tip.orig/drivers/platform/x86/fujitsu-laptop.c +++ linux-2.6-tip/drivers/platform/x86/fujitsu-laptop.c @@ -1301,4 +1301,4 @@ static struct pnp_device_id pnp_ids[] = {.id = ""} }; -MODULE_DEVICE_TABLE(pnp, pnp_ids); +MODULE_STATIC_DEVICE_TABLE(pnp, pnp_ids); Index: linux-2.6-tip/drivers/platform/x86/toshiba_acpi.c =================================================================== --- linux-2.6-tip.orig/drivers/platform/x86/toshiba_acpi.c +++ linux-2.6-tip/drivers/platform/x86/toshiba_acpi.c @@ -729,8 +729,8 @@ static int __init toshiba_acpi_init(void { acpi_status status = AE_OK; u32 hci_result; - bool bt_present; - bool bt_on; + bool uninitialized_var(bt_present); + bool uninitialized_var(bt_on); bool radio_on; int ret = 0; Index: linux-2.6-tip/drivers/pnp/pnpbios/core.c =================================================================== --- linux-2.6-tip.orig/drivers/pnp/pnpbios/core.c +++ linux-2.6-tip/drivers/pnp/pnpbios/core.c @@ -573,6 +573,8 @@ static int __init pnpbios_init(void) fs_initcall(pnpbios_init); +#ifdef CONFIG_HOTPLUG + static int __init pnpbios_thread_init(void) { struct task_struct *task; @@ -583,16 +585,18 @@ static int __init pnpbios_thread_init(vo #endif if (pnpbios_disabled) return 0; -#ifdef CONFIG_HOTPLUG + init_completion(&unload_sem); task = kthread_run(pnp_dock_thread, NULL, "kpnpbiosd"); if (!IS_ERR(task)) unloading = 0; -#endif + return 0; } /* Start the kernel thread later: */ module_init(pnpbios_thread_init); +#endif + EXPORT_SYMBOL(pnpbios_protocol); Index: linux-2.6-tip/drivers/scsi/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/scsi/Kconfig +++ linux-2.6-tip/drivers/scsi/Kconfig @@ -608,6 +608,7 @@ config SCSI_FLASHPOINT config LIBFC tristate "LibFC module" select SCSI_FC_ATTRS + select CRC32 ---help--- Fibre Channel library module Index: linux-2.6-tip/drivers/scsi/advansys.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/advansys.c +++ linux-2.6-tip/drivers/scsi/advansys.c @@ -68,7 +68,9 @@ * 7. advansys_info is not safe against multiple simultaneous callers * 8. Add module_param to override ISA/VLB ioport array */ -#warning this driver is still not properly converted to the DMA API +#ifdef CONFIG_ALLOW_WARNINGS +# warning this driver is still not properly converted to the DMA API +#endif /* Enable driver /proc statistics. */ #define ADVANSYS_STATS @@ -10516,7 +10518,7 @@ AscSendScsiQueue(ASC_DVC_VAR *asc_dvc, A { PortAddr iop_base; uchar free_q_head; - uchar next_qp; + uchar uninitialized_var(next_qp); uchar tid_no; uchar target_ix; int sta; @@ -10945,7 +10947,7 @@ static int asc_execute_scsi_cmnd(struct err_code = asc_dvc->err_code; } else { ADV_DVC_VAR *adv_dvc = &boardp->dvc_var.adv_dvc_var; - ADV_SCSI_REQ_Q *adv_scsiqp; + ADV_SCSI_REQ_Q *uninitialized_var(adv_scsiqp); switch (adv_build_req(boardp, scp, &adv_scsiqp)) { case ASC_NOERROR: @@ -13877,7 +13879,9 @@ static int __devinit advansys_board_foun #endif err_free_proc: kfree(boardp->prtbuf); +#ifdef CONFIG_PROC_FS err_unmap: +#endif if (boardp->ioremap_addr) iounmap(boardp->ioremap_addr); err_shost: Index: linux-2.6-tip/drivers/scsi/dpt_i2o.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/dpt_i2o.c +++ linux-2.6-tip/drivers/scsi/dpt_i2o.c @@ -183,7 +183,7 @@ static struct pci_device_id dptids[] = { { PCI_DPT_VENDOR_ID, PCI_DPT_RAPTOR_DEVICE_ID, PCI_ANY_ID, PCI_ANY_ID,}, { 0, } }; -MODULE_DEVICE_TABLE(pci,dptids); +MODULE_STATIC_DEVICE_TABLE(pci,dptids); static int adpt_detect(struct scsi_host_template* sht) { Index: linux-2.6-tip/drivers/scsi/dtc.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/dtc.c +++ linux-2.6-tip/drivers/scsi/dtc.c @@ -165,36 +165,6 @@ static const struct signature { #define NO_SIGNATURES ARRAY_SIZE(signatures) -#ifndef MODULE -/* - * Function : dtc_setup(char *str, int *ints) - * - * Purpose : LILO command line initialization of the overrides array, - * - * Inputs : str - unused, ints - array of integer parameters with ints[0] - * equal to the number of ints. - * - */ - -static void __init dtc_setup(char *str, int *ints) -{ - static int commandline_current = 0; - int i; - if (ints[0] != 2) - printk("dtc_setup: usage dtc=address,irq\n"); - else if (commandline_current < NO_OVERRIDES) { - overrides[commandline_current].address = ints[1]; - overrides[commandline_current].irq = ints[2]; - for (i = 0; i < NO_BASES; ++i) - if (bases[i].address == ints[1]) { - bases[i].noauto = 1; - break; - } - ++commandline_current; - } -} -#endif - /* * Function : int dtc_detect(struct scsi_host_template * tpnt) * Index: linux-2.6-tip/drivers/scsi/fdomain.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/fdomain.c +++ linux-2.6-tip/drivers/scsi/fdomain.c @@ -1774,7 +1774,7 @@ static struct pci_device_id fdomain_pci_ PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0UL }, { } }; -MODULE_DEVICE_TABLE(pci, fdomain_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, fdomain_pci_tbl); #endif #define driver_template fdomain_driver_template #include "scsi_module.c" Index: linux-2.6-tip/drivers/scsi/g_NCR5380.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/g_NCR5380.c +++ linux-2.6-tip/drivers/scsi/g_NCR5380.c @@ -938,18 +938,6 @@ module_param(ncr_53c400a, int, 0); module_param(dtc_3181e, int, 0); MODULE_LICENSE("GPL"); - -static struct isapnp_device_id id_table[] __devinitdata = { - { - ISAPNP_ANY_ID, ISAPNP_ANY_ID, - ISAPNP_VENDOR('D', 'T', 'C'), ISAPNP_FUNCTION(0x436e), - 0}, - {0} -}; - -MODULE_DEVICE_TABLE(isapnp, id_table); - - __setup("ncr5380=", do_NCR5380_setup); __setup("ncr53c400=", do_NCR53C400_setup); __setup("ncr53c400a=", do_NCR53C400A_setup); Index: linux-2.6-tip/drivers/scsi/initio.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/initio.c +++ linux-2.6-tip/drivers/scsi/initio.c @@ -136,7 +136,7 @@ static struct pci_device_id i91u_pci_dev { PCI_VENDOR_ID_DOMEX, I920_DEVICE_ID, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, { } }; -MODULE_DEVICE_TABLE(pci, i91u_pci_devices); +MODULE_STATIC_DEVICE_TABLE(pci, i91u_pci_devices); #define DEBUG_INTERRUPT 0 #define DEBUG_QUEUE 0 Index: linux-2.6-tip/drivers/scsi/lpfc/lpfc_els.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/lpfc/lpfc_els.c +++ linux-2.6-tip/drivers/scsi/lpfc/lpfc_els.c @@ -3968,7 +3968,8 @@ lpfc_els_rcv_rscn(struct lpfc_vport *vpo struct lpfc_dmabuf *pcmd; uint32_t *lp, *datap; IOCB_t *icmd; - uint32_t payload_len, length, nportid, *cmd; + uint32_t payload_len, uninitialized_var(length), nportid, + *uninitialized_var(cmd); int rscn_cnt; int rscn_id = 0, hba_id = 0; int i; Index: linux-2.6-tip/drivers/scsi/megaraid/megaraid_mm.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/megaraid/megaraid_mm.c +++ linux-2.6-tip/drivers/scsi/megaraid/megaraid_mm.c @@ -117,7 +117,7 @@ mraid_mm_ioctl(struct inode *inode, stru int rval; mraid_mmadp_t *adp; uint8_t old_ioctl; - int drvrcmd_rval; + int uninitialized_var(drvrcmd_rval); void __user *argp = (void __user *)arg; /* Index: linux-2.6-tip/drivers/scsi/ncr53c8xx.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/ncr53c8xx.c +++ linux-2.6-tip/drivers/scsi/ncr53c8xx.c @@ -8295,7 +8295,7 @@ __setup("ncr53c8xx=", ncr53c8xx_setup); struct Scsi_Host * __init ncr_attach(struct scsi_host_template *tpnt, int unit, struct ncr_device *device) { - struct host_data *host_data; + struct host_data *uninitialized_var(host_data); struct ncb *np = NULL; struct Scsi_Host *instance = NULL; u_long flags = 0; Index: linux-2.6-tip/drivers/scsi/qla4xxx/ql4_mbx.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/qla4xxx/ql4_mbx.c +++ linux-2.6-tip/drivers/scsi/qla4xxx/ql4_mbx.c @@ -867,7 +867,7 @@ int qla4xxx_send_tgts(struct scsi_qla_ho { struct dev_db_entry *fw_ddb_entry; dma_addr_t fw_ddb_entry_dma; - uint32_t ddb_index; + uint32_t uninitialized_var(ddb_index); int ret_val = QLA_SUCCESS; Index: linux-2.6-tip/drivers/scsi/scsi_lib.c =================================================================== --- linux-2.6-tip.orig/drivers/scsi/scsi_lib.c +++ linux-2.6-tip/drivers/scsi/scsi_lib.c @@ -703,71 +703,6 @@ void scsi_run_host_queues(struct Scsi_Ho static void __scsi_release_buffers(struct scsi_cmnd *, int); -/* - * Function: scsi_end_request() - * - * Purpose: Post-processing of completed commands (usually invoked at end - * of upper level post-processing and scsi_io_completion). - * - * Arguments: cmd - command that is complete. - * error - 0 if I/O indicates success, < 0 for I/O error. - * bytes - number of bytes of completed I/O - * requeue - indicates whether we should requeue leftovers. - * - * Lock status: Assumed that lock is not held upon entry. - * - * Returns: cmd if requeue required, NULL otherwise. - * - * Notes: This is called for block device requests in order to - * mark some number of sectors as complete. - * - * We are guaranteeing that the request queue will be goosed - * at some point during this call. - * Notes: If cmd was requeued, upon return it will be a stale pointer. - */ -static struct scsi_cmnd *scsi_end_request(struct scsi_cmnd *cmd, int error, - int bytes, int requeue) -{ - struct request_queue *q = cmd->device->request_queue; - struct request *req = cmd->request; - - /* - * If there are blocks left over at the end, set up the command - * to queue the remainder of them. - */ - if (blk_end_request(req, error, bytes)) { - int leftover = (req->hard_nr_sectors << 9); - - if (blk_pc_request(req)) - leftover = req->data_len; - - /* kill remainder if no retrys */ - if (error && scsi_noretry_cmd(cmd)) - blk_end_request(req, error, leftover); - else { - if (requeue) { - /* - * Bleah. Leftovers again. Stick the - * leftovers in the front of the - * queue, and goose the queue again. - */ - scsi_release_buffers(cmd); - scsi_requeue_command(q, cmd); - cmd = NULL; - } - return cmd; - } - } - - /* - * This will goose the queue request function at the end, so we don't - * need to worry about launching another command. - */ - __scsi_release_buffers(cmd, 0); - scsi_next_command(cmd); - return NULL; -} - static inline unsigned int scsi_sgtable_index(unsigned short nents) { unsigned int index; @@ -929,7 +864,6 @@ static void scsi_end_bidi_request(struct void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) { int result = cmd->result; - int this_count; struct request_queue *q = cmd->device->request_queue; struct request *req = cmd->request; int error = 0; @@ -980,18 +914,30 @@ void scsi_io_completion(struct scsi_cmnd SCSI_LOG_HLCOMPLETE(1, printk("%ld sectors total, " "%d bytes done.\n", req->nr_sectors, good_bytes)); - - /* A number of bytes were successfully read. If there - * are leftovers and there is some kind of error - * (result != 0), retry the rest. - */ - if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL) + if (blk_end_request(req, error, good_bytes) == 0) { + /* This request is completely finished; start the next one */ + scsi_next_command(cmd); return; - this_count = blk_rq_bytes(req); + } error = -EIO; - if (host_byte(result) == DID_RESET) { + /* The request isn't finished yet. Figure out what to do next. */ + if (result == 0) { + /* No error, so carry out the remainder of the request. + * Failure to make forward progress counts against the + * the number of retries. + */ + if (good_bytes > 0 || --req->retries >= 0) + action = ACTION_REPREP; + else { + action = ACTION_FAIL; + description = "Retries exhausted"; + } + } else if (error && scsi_noretry_cmd(cmd)) { + /* Retrys are disallowed, so kill the remainder. */ + action = ACTION_FAIL; + } else if (host_byte(result) == DID_RESET) { /* Third party bus reset or reset for error recovery * reasons. Just retry the command and see what * happens. Index: linux-2.6-tip/drivers/telephony/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/telephony/Kconfig +++ linux-2.6-tip/drivers/telephony/Kconfig @@ -20,6 +20,8 @@ if PHONE config PHONE_IXJ tristate "QuickNet Internet LineJack/PhoneJack support" depends on ISA || PCI + # build breakage, config-Sat_Jul_19_00_58_16_CEST_2008.bad + depends on 0 ---help--- Say M if you have a telephony card manufactured by Quicknet Technologies, Inc. These include the Internet PhoneJACK and Index: linux-2.6-tip/drivers/telephony/ixj.c =================================================================== --- linux-2.6-tip.orig/drivers/telephony/ixj.c +++ linux-2.6-tip/drivers/telephony/ixj.c @@ -288,7 +288,7 @@ static struct pci_device_id ixj_pci_tbl[ { } }; -MODULE_DEVICE_TABLE(pci, ixj_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, ixj_pci_tbl); /************************************************************************ * Index: linux-2.6-tip/drivers/usb/atm/ueagle-atm.c =================================================================== --- linux-2.6-tip.orig/drivers/usb/atm/ueagle-atm.c +++ linux-2.6-tip/drivers/usb/atm/ueagle-atm.c @@ -1427,7 +1427,7 @@ static int uea_stat_e1(struct uea_softc static int uea_stat_e4(struct uea_softc *sc) { u32 data; - u32 tmp_arr[2]; + u32 tmp_arr[2] = { 0, }; int ret; uea_enters(INS_TO_USBDEV(sc)); Index: linux-2.6-tip/drivers/usb/gadget/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/usb/gadget/Kconfig +++ linux-2.6-tip/drivers/usb/gadget/Kconfig @@ -15,6 +15,9 @@ menuconfig USB_GADGET tristate "USB Gadget Support" + # crashes on titan with: + # http://redhat.com/~mingo/misc/config-Tue_Jul_22_13_44_45_CEST_2008.bad + depends on 0 help USB is a master/slave protocol, organized with one master host (such as a PC) controlling up to 127 peripheral devices. Index: linux-2.6-tip/drivers/usb/host/Kconfig =================================================================== --- linux-2.6-tip.orig/drivers/usb/host/Kconfig +++ linux-2.6-tip/drivers/usb/host/Kconfig @@ -329,6 +329,8 @@ config USB_WHCI_HCD tristate "Wireless USB Host Controller Interface (WHCI) driver (EXPERIMENTAL)" depends on EXPERIMENTAL depends on PCI && USB + depends on 0 + select USB_WUSB select UWB_WHCI help Index: linux-2.6-tip/drivers/usb/serial/io_edgeport.c =================================================================== --- linux-2.6-tip.orig/drivers/usb/serial/io_edgeport.c +++ linux-2.6-tip/drivers/usb/serial/io_edgeport.c @@ -293,7 +293,7 @@ static void update_edgeport_E2PROM(struc __u16 BootBuildNumber; __u32 Bootaddr; const struct ihex_binrec *rec; - const struct firmware *fw; + const struct firmware *uninitialized_var(fw); const char *fw_name; int response; @@ -2457,7 +2457,7 @@ static int send_cmd_write_baud_rate(stru unsigned char *cmdBuffer; unsigned char *currCmd; int cmdLen = 0; - int divisor; + int uninitialized_var(divisor); int status; unsigned char number = edge_port->port->number - edge_port->port->serial->minor; Index: linux-2.6-tip/drivers/usb/serial/keyspan.c =================================================================== --- linux-2.6-tip.orig/drivers/usb/serial/keyspan.c +++ linux-2.6-tip/drivers/usb/serial/keyspan.c @@ -1345,7 +1345,7 @@ static int keyspan_fake_startup(struct u int response; const struct ihex_binrec *record; char *fw_name; - const struct firmware *fw; + const struct firmware *uninitialized_var(fw); dbg("Keyspan startup version %04x product %04x", le16_to_cpu(serial->dev->descriptor.bcdDevice), Index: linux-2.6-tip/drivers/usb/serial/keyspan_pda.c =================================================================== --- linux-2.6-tip.orig/drivers/usb/serial/keyspan_pda.c +++ linux-2.6-tip/drivers/usb/serial/keyspan_pda.c @@ -456,7 +456,7 @@ static int keyspan_pda_tiocmget(struct t struct usb_serial_port *port = tty->driver_data; struct usb_serial *serial = port->serial; int rc; - unsigned char status; + unsigned char uninitialized_var(status); int value; rc = keyspan_pda_get_modem_info(serial, &status); @@ -478,7 +478,7 @@ static int keyspan_pda_tiocmset(struct t struct usb_serial_port *port = tty->driver_data; struct usb_serial *serial = port->serial; int rc; - unsigned char status; + unsigned char uninitialized_var(status); rc = keyspan_pda_get_modem_info(serial, &status); if (rc < 0) @@ -726,7 +726,7 @@ static int keyspan_pda_fake_startup(stru int response; const char *fw_name; const struct ihex_binrec *record; - const struct firmware *fw; + const struct firmware *uninitialized_var(fw); /* download the firmware here ... */ response = ezusb_set_reset(serial, 1); Index: linux-2.6-tip/drivers/usb/serial/mos7720.c =================================================================== --- linux-2.6-tip.orig/drivers/usb/serial/mos7720.c +++ linux-2.6-tip/drivers/usb/serial/mos7720.c @@ -959,7 +959,7 @@ static int send_cmd_write_baud_rate(stru { struct usb_serial_port *port; struct usb_serial *serial; - int divisor; + int uninitialized_var(divisor); int status; unsigned char data; unsigned char number; Index: linux-2.6-tip/drivers/uwb/i1480/i1480-est.c =================================================================== --- linux-2.6-tip.orig/drivers/uwb/i1480/i1480-est.c +++ linux-2.6-tip/drivers/uwb/i1480/i1480-est.c @@ -96,4 +96,4 @@ static struct usb_device_id i1480_est_id { USB_DEVICE(0x8086, 0x0c3b), }, { }, }; -MODULE_DEVICE_TABLE(usb, i1480_est_id_table); +MODULE_STATIC_DEVICE_TABLE(usb, i1480_est_id_table); Index: linux-2.6-tip/drivers/uwb/whc-rc.c =================================================================== --- linux-2.6-tip.orig/drivers/uwb/whc-rc.c +++ linux-2.6-tip/drivers/uwb/whc-rc.c @@ -452,7 +452,7 @@ static struct pci_device_id whcrc_id_tab { PCI_DEVICE_CLASS(PCI_CLASS_WIRELESS_WHCI, ~0) }, { /* empty last entry */ } }; -MODULE_DEVICE_TABLE(pci, whcrc_id_table); +MODULE_STATIC_DEVICE_TABLE(pci, whcrc_id_table); static struct umc_driver whcrc_driver = { .name = "whc-rc", Index: linux-2.6-tip/drivers/uwb/wlp/messages.c =================================================================== --- linux-2.6-tip.orig/drivers/uwb/wlp/messages.c +++ linux-2.6-tip/drivers/uwb/wlp/messages.c @@ -903,7 +903,7 @@ int wlp_parse_f0(struct wlp *wlp, struct size_t len = skb->len; size_t used; ssize_t result; - struct wlp_nonce enonce, rnonce; + struct wlp_nonce uninitialized_var(enonce), uninitialized_var(rnonce); enum wlp_assc_error assc_err; char enonce_buf[WLP_WSS_NONCE_STRSIZE]; char rnonce_buf[WLP_WSS_NONCE_STRSIZE]; Index: linux-2.6-tip/drivers/video/aty/atyfb_base.c =================================================================== --- linux-2.6-tip.orig/drivers/video/aty/atyfb_base.c +++ linux-2.6-tip/drivers/video/aty/atyfb_base.c @@ -430,7 +430,7 @@ static int __devinit correct_chipset(str u16 type; u32 chip_id; const char *name; - int i; + long i; for (i = ARRAY_SIZE(aty_chips) - 1; i >= 0; i--) if (par->pci_id == aty_chips[i].pci_id) @@ -529,8 +529,10 @@ static int __devinit correct_chipset(str return 0; } +#if defined(CONFIG_FB_ATY_GX) || defined(CONFIG_FB_ATY_CT) static char ram_dram[] __devinitdata = "DRAM"; static char ram_resv[] __devinitdata = "RESV"; +#endif #ifdef CONFIG_FB_ATY_GX static char ram_vram[] __devinitdata = "VRAM"; #endif /* CONFIG_FB_ATY_GX */ @@ -3860,3 +3862,4 @@ MODULE_PARM_DESC(mode, "Specify resoluti module_param(nomtrr, bool, 0); MODULE_PARM_DESC(nomtrr, "bool: disable use of MTRR registers"); #endif + Index: linux-2.6-tip/drivers/video/matrox/matroxfb_crtc2.c =================================================================== --- linux-2.6-tip.orig/drivers/video/matrox/matroxfb_crtc2.c +++ linux-2.6-tip/drivers/video/matrox/matroxfb_crtc2.c @@ -262,7 +262,7 @@ static int matroxfb_dh_open(struct fb_in #define m2info (container_of(info, struct matroxfb_dh_fb_info, fbcon)) MINFO_FROM(m2info->primary_dev); - if (MINFO) { + if (MINFO != NULL) { int err; if (ACCESS_FBINFO(dead)) { @@ -282,7 +282,7 @@ static int matroxfb_dh_release(struct fb int err = 0; MINFO_FROM(m2info->primary_dev); - if (MINFO) { + if (MINFO != NULL) { err = ACCESS_FBINFO(fbops).fb_release(&ACCESS_FBINFO(fbcon), user); } return err; Index: linux-2.6-tip/drivers/video/mb862xx/mb862xxfb.c =================================================================== --- linux-2.6-tip.orig/drivers/video/mb862xx/mb862xxfb.c +++ linux-2.6-tip/drivers/video/mb862xx/mb862xxfb.c @@ -85,6 +85,8 @@ static inline unsigned int chan_to_field return chan << bf->offset; } +#if defined(CONFIG_FB_MB862XX_PCI_GDC) || defined(CONFIG_FB_MB862XX_LIME) + static int mb862xxfb_setcolreg(unsigned regno, unsigned red, unsigned green, unsigned blue, unsigned transp, struct fb_info *info) @@ -458,6 +460,8 @@ static ssize_t mb862xxfb_show_dispregs(s static DEVICE_ATTR(dispregs, 0444, mb862xxfb_show_dispregs, NULL); +#endif + irqreturn_t mb862xx_intr(int irq, void *dev_id) { struct mb862xxfb_par *par = (struct mb862xxfb_par *) dev_id; Index: linux-2.6-tip/drivers/video/sis/init301.c =================================================================== --- linux-2.6-tip.orig/drivers/video/sis/init301.c +++ linux-2.6-tip/drivers/video/sis/init301.c @@ -6691,7 +6691,7 @@ SiS_SetGroup2(struct SiS_Private *SiS_Pr bool newtvphase; const unsigned char *TimingPoint; #ifdef SIS315H - unsigned short resindex, CRT2Index; + unsigned short uninitialized_var(resindex), uninitialized_var(CRT2Index); const struct SiS_Part2PortTbl *CRT2Part2Ptr = NULL; if(SiS_Pr->SiS_VBInfo & SetCRT2ToLCDA) return; Index: linux-2.6-tip/drivers/video/sis/sis_main.c =================================================================== --- linux-2.6-tip.orig/drivers/video/sis/sis_main.c +++ linux-2.6-tip/drivers/video/sis/sis_main.c @@ -4175,6 +4175,7 @@ sisfb_find_rom(struct pci_dev *pdev) return myrombase; } +#if defined(CONFIG_FB_SIS_300) || defined(CONFIG_FB_SIS_315) static void __devinit sisfb_post_map_vram(struct sis_video_info *ivideo, unsigned int *mapsize, unsigned int min) @@ -4197,6 +4198,7 @@ sisfb_post_map_vram(struct sis_video_inf } } } +#endif #ifdef CONFIG_FB_SIS_300 static int __devinit Index: linux-2.6-tip/drivers/watchdog/alim1535_wdt.c =================================================================== --- linux-2.6-tip.orig/drivers/watchdog/alim1535_wdt.c +++ linux-2.6-tip/drivers/watchdog/alim1535_wdt.c @@ -306,7 +306,7 @@ static struct pci_device_id ali_pci_tbl[ { PCI_VENDOR_ID_AL, 0x1535, PCI_ANY_ID, PCI_ANY_ID,}, { 0, }, }; -MODULE_DEVICE_TABLE(pci, ali_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, ali_pci_tbl); /* * ali_find_watchdog - find a 1535 and 7101 Index: linux-2.6-tip/drivers/watchdog/alim7101_wdt.c =================================================================== --- linux-2.6-tip.orig/drivers/watchdog/alim7101_wdt.c +++ linux-2.6-tip/drivers/watchdog/alim7101_wdt.c @@ -427,7 +427,7 @@ static struct pci_device_id alim7101_pci { } }; -MODULE_DEVICE_TABLE(pci, alim7101_pci_tbl); +MODULE_STATIC_DEVICE_TABLE(pci, alim7101_pci_tbl); MODULE_AUTHOR("Steve Hill"); MODULE_DESCRIPTION("ALi M7101 PMU Computer Watchdog Timer driver"); Index: linux-2.6-tip/drivers/watchdog/i6300esb.c =================================================================== --- linux-2.6-tip.orig/drivers/watchdog/i6300esb.c +++ linux-2.6-tip/drivers/watchdog/i6300esb.c @@ -355,20 +355,6 @@ static struct notifier_block esb_notifie }; /* - * Data for PCI driver interface - * - * This data only exists for exporting the supported - * PCI ids via MODULE_DEVICE_TABLE. We do not actually - * register a pci_driver, because someone else might one day - * want to register another driver on the same PCI id. - */ -static struct pci_device_id esb_pci_tbl[] = { - { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ESB_9), }, - { 0, }, /* End of list */ -}; -MODULE_DEVICE_TABLE(pci, esb_pci_tbl); - -/* * Init & exit routines */ Index: linux-2.6-tip/drivers/watchdog/rdc321x_wdt.c =================================================================== --- linux-2.6-tip.orig/drivers/watchdog/rdc321x_wdt.c +++ linux-2.6-tip/drivers/watchdog/rdc321x_wdt.c @@ -37,7 +37,7 @@ #include #include -#include +#include #define RDC_WDT_MASK 0x80000000 /* Mask */ #define RDC_WDT_EN 0x00800000 /* Enable bit */ Index: linux-2.6-tip/drivers/watchdog/w83697ug_wdt.c =================================================================== --- linux-2.6-tip.orig/drivers/watchdog/w83697ug_wdt.c +++ linux-2.6-tip/drivers/watchdog/w83697ug_wdt.c @@ -79,7 +79,7 @@ MODULE_PARM_DESC(nowayout, (same as EFER) */ #define WDT_EFDR (WDT_EFIR+1) /* Extended Function Data Register */ -static void w83697ug_select_wd_register(void) +static int w83697ug_select_wd_register(void) { unsigned char c; unsigned char version; @@ -102,7 +102,7 @@ static void w83697ug_select_wd_register( } else { printk(KERN_ERR PFX "No W83697UG/UF could be found\n"); - return; + return -EIO; } outb_p(0x07, WDT_EFER); /* point to logical device number reg */ @@ -110,6 +110,8 @@ static void w83697ug_select_wd_register( outb_p(0x30, WDT_EFER); /* select CR30 */ c = inb_p(WDT_EFDR); outb_p(c || 0x01, WDT_EFDR); /* set bit 0 to activate GPIO2 */ + + return 0; } static void w83697ug_unselect_wd_register(void) @@ -117,11 +119,12 @@ static void w83697ug_unselect_wd_registe outb_p(0xAA, WDT_EFER); /* Leave extended function mode */ } -static void w83697ug_init(void) +static int w83697ug_init(void) { unsigned char t; - w83697ug_select_wd_register(); + if (w83697ug_select_wd_register()) + return -EIO; outb_p(0xF6, WDT_EFER); /* Select CRF6 */ t = inb_p(WDT_EFDR); /* read CRF6 */ @@ -137,6 +140,8 @@ static void w83697ug_init(void) outb_p(t, WDT_EFDR); /* Write back to CRF5 */ w83697ug_unselect_wd_register(); + + return 0; } static void wdt_ctrl(int timeout) @@ -347,7 +352,11 @@ static int __init wdt_init(void) goto out; } - w83697ug_init(); + ret = w83697ug_init(); + if (ret) { + printk(KERN_ERR PFX "init failed\n"); + goto unreg_regions; + } ret = register_reboot_notifier(&wdt_notifier); if (ret != 0) { Index: linux-2.6-tip/drivers/xen/events.c =================================================================== --- linux-2.6-tip.orig/drivers/xen/events.c +++ linux-2.6-tip/drivers/xen/events.c @@ -26,9 +26,11 @@ #include #include #include +#include #include #include +#include #include #include #include @@ -50,36 +52,55 @@ static DEFINE_PER_CPU(int, virq_to_irq[N /* IRQ <-> IPI mapping */ static DEFINE_PER_CPU(int, ipi_to_irq[XEN_NR_IPIS]) = {[0 ... XEN_NR_IPIS-1] = -1}; -/* Packed IRQ information: binding type, sub-type index, and event channel. */ -struct packed_irq -{ - unsigned short evtchn; - unsigned char index; - unsigned char type; -}; - -static struct packed_irq irq_info[NR_IRQS]; - -/* Binding types. */ -enum { - IRQT_UNBOUND, +/* Interrupt types. */ +enum xen_irq_type { + IRQT_UNBOUND = 0, IRQT_PIRQ, IRQT_VIRQ, IRQT_IPI, IRQT_EVTCHN }; -/* Convenient shorthand for packed representation of an unbound IRQ. */ -#define IRQ_UNBOUND mk_irq_info(IRQT_UNBOUND, 0, 0) +/* + * Packed IRQ information: + * type - enum xen_irq_type + * event channel - irq->event channel mapping + * cpu - cpu this event channel is bound to + * index - type-specific information: + * PIRQ - vector, with MSB being "needs EIO" + * VIRQ - virq number + * IPI - IPI vector + * EVTCHN - + */ +struct irq_info +{ + enum xen_irq_type type; /* type */ + unsigned short evtchn; /* event channel */ + unsigned short cpu; /* cpu bound */ + + union { + unsigned short virq; + enum ipi_vector ipi; + struct { + unsigned short gsi; + unsigned short vector; + } pirq; + } u; +}; + +static struct irq_info irq_info[NR_IRQS]; static int evtchn_to_irq[NR_EVENT_CHANNELS] = { [0 ... NR_EVENT_CHANNELS-1] = -1 }; -static unsigned long cpu_evtchn_mask[NR_CPUS][NR_EVENT_CHANNELS/BITS_PER_LONG]; -static u8 cpu_evtchn[NR_EVENT_CHANNELS]; - -/* Reference counts for bindings to IRQs. */ -static int irq_bindcount[NR_IRQS]; +struct cpu_evtchn_s { + unsigned long bits[NR_EVENT_CHANNELS/BITS_PER_LONG]; +}; +static struct cpu_evtchn_s *cpu_evtchn_mask_p; +static inline unsigned long *cpu_evtchn_mask(int cpu) +{ + return cpu_evtchn_mask_p[cpu].bits; +} /* Xen will never allocate port zero for any purpose. */ #define VALID_EVTCHN(chn) ((chn) != 0) @@ -87,27 +108,108 @@ static int irq_bindcount[NR_IRQS]; static struct irq_chip xen_dynamic_chip; /* Constructor for packed IRQ information. */ -static inline struct packed_irq mk_irq_info(u32 type, u32 index, u32 evtchn) +static struct irq_info mk_unbound_info(void) +{ + return (struct irq_info) { .type = IRQT_UNBOUND }; +} + +static struct irq_info mk_evtchn_info(unsigned short evtchn) +{ + return (struct irq_info) { .type = IRQT_EVTCHN, .evtchn = evtchn, + .cpu = 0 }; +} + +static struct irq_info mk_ipi_info(unsigned short evtchn, enum ipi_vector ipi) { - return (struct packed_irq) { evtchn, index, type }; + return (struct irq_info) { .type = IRQT_IPI, .evtchn = evtchn, + .cpu = 0, .u.ipi = ipi }; +} + +static struct irq_info mk_virq_info(unsigned short evtchn, unsigned short virq) +{ + return (struct irq_info) { .type = IRQT_VIRQ, .evtchn = evtchn, + .cpu = 0, .u.virq = virq }; +} + +static struct irq_info mk_pirq_info(unsigned short evtchn, + unsigned short gsi, unsigned short vector) +{ + return (struct irq_info) { .type = IRQT_PIRQ, .evtchn = evtchn, + .cpu = 0, .u.pirq = { .gsi = gsi, .vector = vector } }; } /* * Accessors for packed IRQ information. */ -static inline unsigned int evtchn_from_irq(int irq) +static struct irq_info *info_for_irq(unsigned irq) +{ + return &irq_info[irq]; +} + +static unsigned int evtchn_from_irq(unsigned irq) +{ + return info_for_irq(irq)->evtchn; +} + +static enum ipi_vector ipi_from_irq(unsigned irq) +{ + struct irq_info *info = info_for_irq(irq); + + BUG_ON(info == NULL); + BUG_ON(info->type != IRQT_IPI); + + return info->u.ipi; +} + +static unsigned virq_from_irq(unsigned irq) { - return irq_info[irq].evtchn; + struct irq_info *info = info_for_irq(irq); + + BUG_ON(info == NULL); + BUG_ON(info->type != IRQT_VIRQ); + + return info->u.virq; +} + +static unsigned gsi_from_irq(unsigned irq) +{ + struct irq_info *info = info_for_irq(irq); + + BUG_ON(info == NULL); + BUG_ON(info->type != IRQT_PIRQ); + + return info->u.pirq.gsi; +} + +static unsigned vector_from_irq(unsigned irq) +{ + struct irq_info *info = info_for_irq(irq); + + BUG_ON(info == NULL); + BUG_ON(info->type != IRQT_PIRQ); + + return info->u.pirq.vector; } -static inline unsigned int index_from_irq(int irq) +static enum xen_irq_type type_from_irq(unsigned irq) { - return irq_info[irq].index; + return info_for_irq(irq)->type; } -static inline unsigned int type_from_irq(int irq) +static unsigned cpu_from_irq(unsigned irq) { - return irq_info[irq].type; + return info_for_irq(irq)->cpu; +} + +static unsigned int cpu_from_evtchn(unsigned int evtchn) +{ + int irq = evtchn_to_irq[evtchn]; + unsigned ret = 0; + + if (irq != -1) + ret = cpu_from_irq(irq); + + return ret; } static inline unsigned long active_evtchns(unsigned int cpu, @@ -115,7 +217,7 @@ static inline unsigned long active_evtch unsigned int idx) { return (sh->evtchn_pending[idx] & - cpu_evtchn_mask[cpu][idx] & + cpu_evtchn_mask(cpu)[idx] & ~sh->evtchn_mask[idx]); } @@ -125,13 +227,13 @@ static void bind_evtchn_to_cpu(unsigned BUG_ON(irq == -1); #ifdef CONFIG_SMP - irq_to_desc(irq)->affinity = cpumask_of_cpu(cpu); + cpumask_copy(irq_to_desc(irq)->affinity, cpumask_of(cpu)); #endif - __clear_bit(chn, cpu_evtchn_mask[cpu_evtchn[chn]]); - __set_bit(chn, cpu_evtchn_mask[cpu]); + __clear_bit(chn, cpu_evtchn_mask(cpu_from_irq(irq))); + __set_bit(chn, cpu_evtchn_mask(cpu)); - cpu_evtchn[chn] = cpu; + irq_info[irq].cpu = cpu; } static void init_evtchn_cpu_bindings(void) @@ -142,17 +244,11 @@ static void init_evtchn_cpu_bindings(voi /* By default all event channels notify CPU#0. */ for_each_irq_desc(i, desc) { - desc->affinity = cpumask_of_cpu(0); + cpumask_copy(desc->affinity, cpumask_of(0)); } #endif - memset(cpu_evtchn, 0, sizeof(cpu_evtchn)); - memset(cpu_evtchn_mask[0], ~0, sizeof(cpu_evtchn_mask[0])); -} - -static inline unsigned int cpu_from_evtchn(unsigned int evtchn) -{ - return cpu_evtchn[evtchn]; + memset(cpu_evtchn_mask(0), ~0, sizeof(cpu_evtchn_mask(0))); } static inline void clear_evtchn(int port) @@ -232,9 +328,8 @@ static int find_unbound_irq(void) int irq; struct irq_desc *desc; - /* Only allocate from dynirq range */ for (irq = 0; irq < nr_irqs; irq++) - if (irq_bindcount[irq] == 0) + if (irq_info[irq].type == IRQT_UNBOUND) break; if (irq == nr_irqs) @@ -244,6 +339,8 @@ static int find_unbound_irq(void) if (WARN_ON(desc == NULL)) return -1; + dynamic_irq_init(irq); + return irq; } @@ -258,16 +355,13 @@ int bind_evtchn_to_irq(unsigned int evtc if (irq == -1) { irq = find_unbound_irq(); - dynamic_irq_init(irq); set_irq_chip_and_handler_name(irq, &xen_dynamic_chip, handle_level_irq, "event"); evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_EVTCHN, 0, evtchn); + irq_info[irq] = mk_evtchn_info(evtchn); } - irq_bindcount[irq]++; - spin_unlock(&irq_mapping_update_lock); return irq; @@ -282,12 +376,12 @@ static int bind_ipi_to_irq(unsigned int spin_lock(&irq_mapping_update_lock); irq = per_cpu(ipi_to_irq, cpu)[ipi]; + if (irq == -1) { irq = find_unbound_irq(); if (irq < 0) goto out; - dynamic_irq_init(irq); set_irq_chip_and_handler_name(irq, &xen_dynamic_chip, handle_level_irq, "ipi"); @@ -298,15 +392,12 @@ static int bind_ipi_to_irq(unsigned int evtchn = bind_ipi.port; evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_IPI, ipi, evtchn); - + irq_info[irq] = mk_ipi_info(evtchn, ipi); per_cpu(ipi_to_irq, cpu)[ipi] = irq; bind_evtchn_to_cpu(evtchn, cpu); } - irq_bindcount[irq]++; - out: spin_unlock(&irq_mapping_update_lock); return irq; @@ -332,20 +423,17 @@ static int bind_virq_to_irq(unsigned int irq = find_unbound_irq(); - dynamic_irq_init(irq); set_irq_chip_and_handler_name(irq, &xen_dynamic_chip, handle_level_irq, "virq"); evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_VIRQ, virq, evtchn); + irq_info[irq] = mk_virq_info(evtchn, virq); per_cpu(virq_to_irq, cpu)[virq] = irq; bind_evtchn_to_cpu(evtchn, cpu); } - irq_bindcount[irq]++; - spin_unlock(&irq_mapping_update_lock); return irq; @@ -358,7 +446,7 @@ static void unbind_from_irq(unsigned int spin_lock(&irq_mapping_update_lock); - if ((--irq_bindcount[irq] == 0) && VALID_EVTCHN(evtchn)) { + if (VALID_EVTCHN(evtchn)) { close.port = evtchn; if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) != 0) BUG(); @@ -366,11 +454,11 @@ static void unbind_from_irq(unsigned int switch (type_from_irq(irq)) { case IRQT_VIRQ: per_cpu(virq_to_irq, cpu_from_evtchn(evtchn)) - [index_from_irq(irq)] = -1; + [virq_from_irq(irq)] = -1; break; case IRQT_IPI: per_cpu(ipi_to_irq, cpu_from_evtchn(evtchn)) - [index_from_irq(irq)] = -1; + [ipi_from_irq(irq)] = -1; break; default: break; @@ -380,7 +468,7 @@ static void unbind_from_irq(unsigned int bind_evtchn_to_cpu(evtchn, 0); evtchn_to_irq[evtchn] = -1; - irq_info[irq] = IRQ_UNBOUND; + irq_info[irq] = mk_unbound_info(); dynamic_irq_cleanup(irq); } @@ -498,8 +586,8 @@ irqreturn_t xen_debug_interrupt(int irq, for(i = 0; i < NR_EVENT_CHANNELS; i++) { if (sync_test_bit(i, sh->evtchn_pending)) { printk(" %d: event %d -> irq %d\n", - cpu_evtchn[i], i, - evtchn_to_irq[i]); + cpu_from_evtchn(i), i, + evtchn_to_irq[i]); } } @@ -508,7 +596,6 @@ irqreturn_t xen_debug_interrupt(int irq, return IRQ_HANDLED; } - /* * Search the CPUs pending events bitmasks. For each one found, map * the event number to an irq, and feed it into do_IRQ() for @@ -521,11 +608,15 @@ irqreturn_t xen_debug_interrupt(int irq, void xen_evtchn_do_upcall(struct pt_regs *regs) { int cpu = get_cpu(); + struct pt_regs *old_regs = set_irq_regs(regs); struct shared_info *s = HYPERVISOR_shared_info; struct vcpu_info *vcpu_info = __get_cpu_var(xen_vcpu); static DEFINE_PER_CPU(unsigned, nesting_count); unsigned count; + exit_idle(); + irq_enter(); + do { unsigned long pending_words; @@ -550,7 +641,7 @@ void xen_evtchn_do_upcall(struct pt_regs int irq = evtchn_to_irq[port]; if (irq != -1) - xen_do_IRQ(irq, regs); + handle_irq(irq, regs); } } @@ -561,12 +652,17 @@ void xen_evtchn_do_upcall(struct pt_regs } while(count != 1); out: + irq_exit(); + set_irq_regs(old_regs); + put_cpu(); } /* Rebind a new event channel to an existing irq. */ void rebind_evtchn_irq(int evtchn, int irq) { + struct irq_info *info = info_for_irq(irq); + /* Make sure the irq is masked, since the new event channel will also be masked. */ disable_irq(irq); @@ -576,11 +672,11 @@ void rebind_evtchn_irq(int evtchn, int i /* After resume the irq<->evtchn mappings are all cleared out */ BUG_ON(evtchn_to_irq[evtchn] != -1); /* Expect irq to have been bound before, - so the bindcount should be non-0 */ - BUG_ON(irq_bindcount[irq] == 0); + so there should be a proper type */ + BUG_ON(info->type == IRQT_UNBOUND); evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_EVTCHN, 0, evtchn); + irq_info[irq] = mk_evtchn_info(evtchn); spin_unlock(&irq_mapping_update_lock); @@ -690,8 +786,7 @@ static void restore_cpu_virqs(unsigned i if ((irq = per_cpu(virq_to_irq, cpu)[virq]) == -1) continue; - BUG_ON(irq_info[irq].type != IRQT_VIRQ); - BUG_ON(irq_info[irq].index != virq); + BUG_ON(virq_from_irq(irq) != virq); /* Get a new binding from Xen. */ bind_virq.virq = virq; @@ -703,7 +798,7 @@ static void restore_cpu_virqs(unsigned i /* Record the new mapping. */ evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_VIRQ, virq, evtchn); + irq_info[irq] = mk_virq_info(evtchn, virq); bind_evtchn_to_cpu(evtchn, cpu); /* Ready for use. */ @@ -720,8 +815,7 @@ static void restore_cpu_ipis(unsigned in if ((irq = per_cpu(ipi_to_irq, cpu)[ipi]) == -1) continue; - BUG_ON(irq_info[irq].type != IRQT_IPI); - BUG_ON(irq_info[irq].index != ipi); + BUG_ON(ipi_from_irq(irq) != ipi); /* Get a new binding from Xen. */ bind_ipi.vcpu = cpu; @@ -732,7 +826,7 @@ static void restore_cpu_ipis(unsigned in /* Record the new mapping. */ evtchn_to_irq[evtchn] = irq; - irq_info[irq] = mk_irq_info(IRQT_IPI, ipi, evtchn); + irq_info[irq] = mk_ipi_info(evtchn, ipi); bind_evtchn_to_cpu(evtchn, cpu); /* Ready for use. */ @@ -812,8 +906,11 @@ void xen_irq_resume(void) static struct irq_chip xen_dynamic_chip __read_mostly = { .name = "xen-dyn", + + .disable = disable_dynirq, .mask = disable_dynirq, .unmask = enable_dynirq, + .ack = ack_dynirq, .set_affinity = set_affinity_irq, .retrigger = retrigger_dynirq, @@ -822,6 +919,10 @@ static struct irq_chip xen_dynamic_chip void __init xen_init_IRQ(void) { int i; + size_t size = nr_cpu_ids * sizeof(struct cpu_evtchn_s); + + cpu_evtchn_mask_p = alloc_bootmem(size); + BUG_ON(cpu_evtchn_mask_p == NULL); init_evtchn_cpu_bindings(); @@ -829,9 +930,5 @@ void __init xen_init_IRQ(void) for (i = 0; i < NR_EVENT_CHANNELS; i++) mask_evtchn(i); - /* Dynamic IRQ space is currently unbound. Zero the refcnts. */ - for (i = 0; i < nr_irqs; i++) - irq_bindcount[i] = 0; - irq_ctx_init(smp_processor_id()); } Index: linux-2.6-tip/drivers/xen/manage.c =================================================================== --- linux-2.6-tip.orig/drivers/xen/manage.c +++ linux-2.6-tip/drivers/xen/manage.c @@ -108,7 +108,7 @@ static void do_suspend(void) /* XXX use normal device tree? */ xenbus_suspend(); - err = stop_machine(xen_suspend, &cancelled, &cpumask_of_cpu(0)); + err = stop_machine(xen_suspend, &cancelled, cpumask_of(0)); if (err) { printk(KERN_ERR "failed to start xen_suspend: %d\n", err); goto out; Index: linux-2.6-tip/fs/Kconfig =================================================================== --- linux-2.6-tip.orig/fs/Kconfig +++ linux-2.6-tip/fs/Kconfig @@ -40,7 +40,7 @@ config FS_POSIX_ACL default n config FILE_LOCKING - bool "Enable POSIX file locking API" if EMBEDDED + bool "Enable POSIX file locking API" if BROKEN default y help This option enables standard file locking support, required Index: linux-2.6-tip/fs/afs/dir.c =================================================================== --- linux-2.6-tip.orig/fs/afs/dir.c +++ linux-2.6-tip/fs/afs/dir.c @@ -564,7 +564,7 @@ static struct dentry *afs_lookup(struct static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd) { struct afs_vnode *vnode, *dir; - struct afs_fid fid; + struct afs_fid fid = { 0, }; struct dentry *parent; struct key *key; void *dir_version; Index: linux-2.6-tip/fs/befs/linuxvfs.c =================================================================== --- linux-2.6-tip.orig/fs/befs/linuxvfs.c +++ linux-2.6-tip/fs/befs/linuxvfs.c @@ -168,7 +168,7 @@ befs_lookup(struct inode *dir, struct de befs_off_t offset; int ret; int utfnamelen; - char *utfname; + char *uninitialized_var(utfname); const char *name = dentry->d_name.name; befs_debug(sb, "---> befs_lookup() " @@ -221,8 +221,8 @@ befs_readdir(struct file *filp, void *di size_t keysize; unsigned char d_type; char keybuf[BEFS_NAME_LEN + 1]; - char *nlsname; - int nlsnamelen; + char *uninitialized_var(nlsname); + int uninitialized_var(nlsnamelen); const char *dirname = filp->f_path.dentry->d_name.name; befs_debug(sb, "---> befs_readdir() " Index: linux-2.6-tip/fs/cifs/cifssmb.c =================================================================== --- linux-2.6-tip.orig/fs/cifs/cifssmb.c +++ linux-2.6-tip/fs/cifs/cifssmb.c @@ -3117,7 +3117,7 @@ CIFSSMBGetCIFSACL(const int xid, struct __u32 parm_len; __u32 acl_len; struct smb_com_ntransact_rsp *pSMBr; - char *pdata; + char *uninitialized_var(pdata); /* validate_nttransact */ rc = validate_ntransact(iov[0].iov_base, (char **)&parm, Index: linux-2.6-tip/fs/cifs/readdir.c =================================================================== --- linux-2.6-tip.orig/fs/cifs/readdir.c +++ linux-2.6-tip/fs/cifs/readdir.c @@ -906,7 +906,7 @@ static int cifs_filldir(char *pfindEntry __u64 inum; struct cifs_sb_info *cifs_sb; struct inode *tmp_inode; - struct dentry *tmp_dentry; + struct dentry *uninitialized_var(tmp_dentry); /* get filename and len into qstring */ /* get dentry */ @@ -990,7 +990,7 @@ int cifs_readdir(struct file *file, void struct cifs_sb_info *cifs_sb; struct cifsTconInfo *pTcon; struct cifsFileInfo *cifsFile = NULL; - char *current_entry; + char *uninitialized_var(current_entry); int num_to_fill = 0; char *tmp_buf = NULL; char *end_of_smb; Index: linux-2.6-tip/fs/coda/Makefile =================================================================== --- linux-2.6-tip.orig/fs/coda/Makefile +++ linux-2.6-tip/fs/coda/Makefile @@ -5,7 +5,9 @@ obj-$(CONFIG_CODA_FS) += coda.o coda-objs := psdev.o cache.o cnode.o inode.o dir.o file.o upcall.o \ - coda_linux.o symlink.o pioctl.o sysctl.o + coda_linux.o symlink.o pioctl.o + +coda-$(CONFIG_SYSCTL) += sysctl.o # If you want debugging output, please uncomment the following line. Index: linux-2.6-tip/fs/coda/coda_int.h =================================================================== --- linux-2.6-tip.orig/fs/coda/coda_int.h +++ linux-2.6-tip/fs/coda/coda_int.h @@ -12,8 +12,13 @@ void coda_destroy_inodecache(void); int coda_init_inodecache(void); int coda_fsync(struct file *coda_file, struct dentry *coda_dentry, int datasync); +#ifdef CONFIG_SYSCTL void coda_sysctl_init(void); void coda_sysctl_clean(void); +#else +static inline void coda_sysctl_init(void) { } +static inline void coda_sysctl_clean(void) { } +#endif #endif /* _CODA_INT_ */ Index: linux-2.6-tip/fs/coda/sysctl.c =================================================================== --- linux-2.6-tip.orig/fs/coda/sysctl.c +++ linux-2.6-tip/fs/coda/sysctl.c @@ -57,18 +57,14 @@ static ctl_table fs_table[] = { void coda_sysctl_init(void) { -#ifdef CONFIG_SYSCTL - if ( !fs_table_header ) + if (!fs_table_header) fs_table_header = register_sysctl_table(fs_table); -#endif } void coda_sysctl_clean(void) { -#ifdef CONFIG_SYSCTL - if ( fs_table_header ) { + if (fs_table_header) { unregister_sysctl_table(fs_table_header); fs_table_header = NULL; } -#endif } Index: linux-2.6-tip/fs/compat_binfmt_elf.c =================================================================== --- linux-2.6-tip.orig/fs/compat_binfmt_elf.c +++ linux-2.6-tip/fs/compat_binfmt_elf.c @@ -42,6 +42,7 @@ #define elf_prstatus compat_elf_prstatus #define elf_prpsinfo compat_elf_prpsinfo +#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE) /* * Compat version of cputime_to_compat_timeval, perhaps this * should be an inline in . @@ -55,8 +56,9 @@ static void cputime_to_compat_timeval(co value->tv_usec = tv.tv_usec; } -#undef cputime_to_timeval -#define cputime_to_timeval cputime_to_compat_timeval +# undef cputime_to_timeval +# define cputime_to_timeval cputime_to_compat_timeval +#endif /* Index: linux-2.6-tip/fs/configfs/symlink.c =================================================================== --- linux-2.6-tip.orig/fs/configfs/symlink.c +++ linux-2.6-tip/fs/configfs/symlink.c @@ -135,7 +135,7 @@ int configfs_symlink(struct inode *dir, struct path path; struct configfs_dirent *sd; struct config_item *parent_item; - struct config_item *target_item; + struct config_item *uninitialized_var(target_item); struct config_item_type *type; ret = -EPERM; /* What lack-of-symlink returns */ Index: linux-2.6-tip/fs/dcache.c =================================================================== --- linux-2.6-tip.orig/fs/dcache.c +++ linux-2.6-tip/fs/dcache.c @@ -726,8 +726,9 @@ void shrink_dcache_for_umount(struct sup { struct dentry *dentry; - if (down_read_trylock(&sb->s_umount)) - BUG(); +// -rt: this might succeed there ... +// if (down_read_trylock(&sb->s_umount)) +// BUG(); dentry = sb->s_root; sb->s_root = NULL; @@ -1877,6 +1878,8 @@ out_nolock: shouldnt_be_hashed: spin_unlock(&dcache_lock); BUG(); + + return NULL; } static int prepend(char **buffer, int *buflen, const char *str, int namelen) Index: linux-2.6-tip/fs/ecryptfs/keystore.c =================================================================== --- linux-2.6-tip.orig/fs/ecryptfs/keystore.c +++ linux-2.6-tip/fs/ecryptfs/keystore.c @@ -1013,7 +1013,7 @@ decrypt_pki_encrypted_session_key(struct struct ecryptfs_message *msg = NULL; char *auth_tok_sig; char *payload; - size_t payload_len; + size_t uninitialized_var(payload_len); int rc; rc = ecryptfs_get_auth_tok_sig(&auth_tok_sig, auth_tok); @@ -1845,7 +1845,7 @@ pki_encrypt_session_key(struct ecryptfs_ { struct ecryptfs_msg_ctx *msg_ctx = NULL; char *payload = NULL; - size_t payload_len; + size_t uninitialized_var(payload_len); struct ecryptfs_message *msg; int rc; Index: linux-2.6-tip/fs/eventpoll.c =================================================================== --- linux-2.6-tip.orig/fs/eventpoll.c +++ linux-2.6-tip/fs/eventpoll.c @@ -1098,7 +1098,7 @@ retry: SYSCALL_DEFINE1(epoll_create1, int, flags) { int error, fd = -1; - struct eventpoll *ep; + struct eventpoll *uninitialized_var(ep); /* Check the EPOLL_* constant for consistency. */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); Index: linux-2.6-tip/fs/exec.c =================================================================== --- linux-2.6-tip.orig/fs/exec.c +++ linux-2.6-tip/fs/exec.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -46,6 +47,7 @@ #include #include #include +#include #include #include #include @@ -738,10 +740,12 @@ static int exec_mmap(struct mm_struct *m } } task_lock(tsk); + local_irq_disable(); active_mm = tsk->active_mm; + activate_mm(active_mm, mm); tsk->mm = mm; tsk->active_mm = mm; - activate_mm(active_mm, mm); + local_irq_enable(); task_unlock(tsk); arch_pick_mmap_layout(mm); if (old_mm) { @@ -1010,6 +1014,13 @@ int flush_old_exec(struct linux_binprm * current->personality &= ~bprm->per_clear; + /* + * Flush performance counters when crossing a + * security domain: + */ + if (!get_dumpable(current->mm)) + perf_counter_exit_task(current); + /* An exec changes our domain. We are no longer part of the thread group */ Index: linux-2.6-tip/fs/ext4/extents.c =================================================================== --- linux-2.6-tip.orig/fs/ext4/extents.c +++ linux-2.6-tip/fs/ext4/extents.c @@ -1158,6 +1158,7 @@ ext4_ext_search_right(struct inode *inod return 0; } + ix = NULL; /* avoid gcc false positive warning */ /* go up and search for index to the right */ while (--depth >= 0) { ix = path[depth].p_idx; Index: linux-2.6-tip/fs/fat/namei_vfat.c =================================================================== --- linux-2.6-tip.orig/fs/fat/namei_vfat.c +++ linux-2.6-tip/fs/fat/namei_vfat.c @@ -595,12 +595,12 @@ static int vfat_build_slots(struct inode struct fat_mount_options *opts = &sbi->options; struct msdos_dir_slot *ps; struct msdos_dir_entry *de; - unsigned char cksum, lcase; + unsigned char cksum, uninitialized_var(lcase); unsigned char msdos_name[MSDOS_NAME]; wchar_t *uname; __le16 time, date; u8 time_cs; - int err, ulen, usize, i; + int err, uninitialized_var(ulen), usize, i; loff_t offset; *nr_slots = 0; Index: linux-2.6-tip/fs/jffs2/Kconfig =================================================================== --- linux-2.6-tip.orig/fs/jffs2/Kconfig +++ linux-2.6-tip/fs/jffs2/Kconfig @@ -2,6 +2,8 @@ config JFFS2_FS tristate "Journalling Flash File System v2 (JFFS2) support" select CRC32 depends on MTD + # build breakage + depends on 0 help JFFS2 is the second generation of the Journalling Flash File System for use on diskless embedded devices. It provides improved wear Index: linux-2.6-tip/fs/jfs/jfs_dmap.c =================================================================== --- linux-2.6-tip.orig/fs/jfs/jfs_dmap.c +++ linux-2.6-tip/fs/jfs/jfs_dmap.c @@ -1618,7 +1618,7 @@ static int dbAllocAny(struct bmap * bmp, */ static int dbFindCtl(struct bmap * bmp, int l2nb, int level, s64 * blkno) { - int rc, leafidx, lev; + int rc, uninitialized_var(leafidx), lev; s64 b, lblkno; struct dmapctl *dcp; int budmin; Index: linux-2.6-tip/fs/locks.c =================================================================== --- linux-2.6-tip.orig/fs/locks.c +++ linux-2.6-tip/fs/locks.c @@ -1567,7 +1567,7 @@ EXPORT_SYMBOL(flock_lock_file_wait); SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) { struct file *filp; - struct file_lock *lock; + struct file_lock *uninitialized_var(lock); int can_sleep, unlock; int error; Index: linux-2.6-tip/fs/ocfs2/aops.c =================================================================== --- linux-2.6-tip.orig/fs/ocfs2/aops.c +++ linux-2.6-tip/fs/ocfs2/aops.c @@ -1643,7 +1643,7 @@ int ocfs2_write_begin_nolock(struct addr { int ret, credits = OCFS2_INODE_UPDATE_CREDITS; unsigned int clusters_to_alloc, extents_to_split; - struct ocfs2_write_ctxt *wc; + struct ocfs2_write_ctxt *uninitialized_var(wc); struct inode *inode = mapping->host; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); struct ocfs2_dinode *di; Index: linux-2.6-tip/fs/ocfs2/cluster/heartbeat.c =================================================================== --- linux-2.6-tip.orig/fs/ocfs2/cluster/heartbeat.c +++ linux-2.6-tip/fs/ocfs2/cluster/heartbeat.c @@ -1026,8 +1026,8 @@ static ssize_t o2hb_region_block_bytes_w size_t count) { int status; - unsigned long block_bytes; - unsigned int block_bits; + unsigned long uninitialized_var(block_bytes); + unsigned int uninitialized_var(block_bits); if (reg->hr_bdev) return -EINVAL; Index: linux-2.6-tip/fs/ocfs2/ioctl.c =================================================================== --- linux-2.6-tip.orig/fs/ocfs2/ioctl.c +++ linux-2.6-tip/fs/ocfs2/ioctl.c @@ -111,7 +111,7 @@ bail: long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = filp->f_path.dentry->d_inode; - unsigned int flags; + unsigned int uninitialized_var(flags); int new_clusters; int status; struct ocfs2_space_resv sr; Index: linux-2.6-tip/fs/ocfs2/slot_map.c =================================================================== --- linux-2.6-tip.orig/fs/ocfs2/slot_map.c +++ linux-2.6-tip/fs/ocfs2/slot_map.c @@ -357,7 +357,7 @@ static int ocfs2_map_slot_buffers(struct { int status = 0; u64 blkno; - unsigned long long blocks, bytes; + unsigned long long blocks, uninitialized_var(bytes); unsigned int i; struct buffer_head *bh; Index: linux-2.6-tip/fs/ocfs2/stack_user.c =================================================================== --- linux-2.6-tip.orig/fs/ocfs2/stack_user.c +++ linux-2.6-tip/fs/ocfs2/stack_user.c @@ -807,7 +807,7 @@ static int fs_protocol_compare(struct oc static int user_cluster_connect(struct ocfs2_cluster_connection *conn) { dlm_lockspace_t *fsdlm; - struct ocfs2_live_connection *control; + struct ocfs2_live_connection *uninitialized_var(control); int rc = 0; BUG_ON(conn == NULL); Index: linux-2.6-tip/fs/omfs/file.c =================================================================== --- linux-2.6-tip.orig/fs/omfs/file.c +++ linux-2.6-tip/fs/omfs/file.c @@ -237,14 +237,14 @@ static int omfs_get_block(struct inode * struct buffer_head *bh; sector_t next, offset; int ret; - u64 new_block; + u64 uninitialized_var(new_block); u32 max_extents; int extent_count; struct omfs_extent *oe; struct omfs_extent_entry *entry; struct omfs_sb_info *sbi = OMFS_SB(inode->i_sb); int max_blocks = bh_result->b_size >> inode->i_blkbits; - int remain; + int uninitialized_var(remain); ret = -EIO; bh = sb_bread(inode->i_sb, clus_to_blk(sbi, inode->i_ino)); Index: linux-2.6-tip/fs/partitions/check.c =================================================================== --- linux-2.6-tip.orig/fs/partitions/check.c +++ linux-2.6-tip/fs/partitions/check.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "check.h" @@ -294,6 +295,9 @@ static struct attribute_group part_attr_ static struct attribute_group *part_attr_groups[] = { &part_attr_group, +#ifdef CONFIG_BLK_DEV_IO_TRACE + &blk_trace_attr_group, +#endif NULL }; Index: linux-2.6-tip/fs/reiserfs/do_balan.c =================================================================== --- linux-2.6-tip.orig/fs/reiserfs/do_balan.c +++ linux-2.6-tip/fs/reiserfs/do_balan.c @@ -1295,9 +1295,8 @@ static int balance_leaf(struct tree_bala RFALSE(ih, "PAP-12210: ih must be 0"); - if (is_direntry_le_ih - (aux_ih = - B_N_PITEM_HEAD(tbS0, item_pos))) { + aux_ih = B_N_PITEM_HEAD(tbS0, item_pos); + if (is_direntry_le_ih(aux_ih)) { /* we append to directory item */ int entry_count; Index: linux-2.6-tip/fs/reiserfs/lbalance.c =================================================================== --- linux-2.6-tip.orig/fs/reiserfs/lbalance.c +++ linux-2.6-tip/fs/reiserfs/lbalance.c @@ -389,7 +389,8 @@ static void leaf_item_bottle(struct buff if (last_first == FIRST_TO_LAST) { /* if ( if item in position item_num in buffer SOURCE is directory item ) */ - if (is_direntry_le_ih(ih = B_N_PITEM_HEAD(src, item_num))) + ih = B_N_PITEM_HEAD(src, item_num); + if (is_direntry_le_ih(ih)) leaf_copy_dir_entries(dest_bi, src, FIRST_TO_LAST, item_num, 0, cpy_bytes); else { @@ -417,7 +418,8 @@ static void leaf_item_bottle(struct buff } } else { /* if ( if item in position item_num in buffer SOURCE is directory item ) */ - if (is_direntry_le_ih(ih = B_N_PITEM_HEAD(src, item_num))) + ih = B_N_PITEM_HEAD(src, item_num); + if (is_direntry_le_ih(ih)) leaf_copy_dir_entries(dest_bi, src, LAST_TO_FIRST, item_num, I_ENTRY_COUNT(ih) - cpy_bytes, @@ -774,8 +776,8 @@ void leaf_delete_items(struct buffer_inf leaf_delete_items_entirely(cur_bi, first + 1, del_num - 1); - if (is_direntry_le_ih - (ih = B_N_PITEM_HEAD(bh, B_NR_ITEMS(bh) - 1))) + ih = B_N_PITEM_HEAD(bh, B_NR_ITEMS(bh) - 1); + if (is_direntry_le_ih(ih)) /* the last item is directory */ /* len = numbers of directory entries in this item */ len = ih_entry_count(ih); Index: linux-2.6-tip/fs/udf/truncate.c =================================================================== --- linux-2.6-tip.orig/fs/udf/truncate.c +++ linux-2.6-tip/fs/udf/truncate.c @@ -87,7 +87,7 @@ void udf_truncate_tail_extent(struct ino else if (iinfo->i_alloc_type == ICBTAG_FLAG_AD_LONG) adsize = sizeof(long_ad); else - BUG(); + panic("udf_truncate_tail_extent: unknown alloc type!"); /* Find the last extent in the file */ while ((netype = udf_next_aext(inode, &epos, &eloc, &elen, 1)) != -1) { @@ -214,7 +214,7 @@ void udf_truncate_extents(struct inode * else if (iinfo->i_alloc_type == ICBTAG_FLAG_AD_LONG) adsize = sizeof(long_ad); else - BUG(); + panic("udf_truncate_extents: unknown alloc type!"); etype = inode_bmap(inode, first_block, &epos, &eloc, &elen, &offset); byte_offset = (offset << sb->s_blocksize_bits) + Index: linux-2.6-tip/fs/xfs/linux-2.6/xfs_xattr.c =================================================================== --- linux-2.6-tip.orig/fs/xfs/linux-2.6/xfs_xattr.c +++ linux-2.6-tip/fs/xfs/linux-2.6/xfs_xattr.c @@ -30,20 +30,6 @@ /* - * ACL handling. Should eventually be moved into xfs_acl.c - */ - -static int -xfs_decode_acl(const char *name) -{ - if (strcmp(name, "posix_acl_access") == 0) - return _ACL_TYPE_ACCESS; - else if (strcmp(name, "posix_acl_default") == 0) - return _ACL_TYPE_DEFAULT; - return -EINVAL; -} - -/* * Get system extended attributes which at the moment only * includes Posix ACLs. */ Index: linux-2.6-tip/fs/xfs/xfs_acl.c =================================================================== --- linux-2.6-tip.orig/fs/xfs/xfs_acl.c +++ linux-2.6-tip/fs/xfs/xfs_acl.c @@ -51,6 +51,19 @@ kmem_zone_t *xfs_acl_zone; /* + * ACL handling. + */ +int +xfs_decode_acl(const char *name) +{ + if (strcmp(name, "posix_acl_access") == 0) + return _ACL_TYPE_ACCESS; + else if (strcmp(name, "posix_acl_default") == 0) + return _ACL_TYPE_DEFAULT; + return -EINVAL; +} + +/* * Test for existence of access ACL attribute as efficiently as possible. */ int Index: linux-2.6-tip/fs/xfs/xfs_acl.h =================================================================== --- linux-2.6-tip.orig/fs/xfs/xfs_acl.h +++ linux-2.6-tip/fs/xfs/xfs_acl.h @@ -58,6 +58,7 @@ extern struct kmem_zone *xfs_acl_zone; (zone) = kmem_zone_init(sizeof(xfs_acl_t), (name)) #define xfs_acl_zone_destroy(zone) kmem_zone_destroy(zone) +extern int xfs_decode_acl(const char *); extern int xfs_acl_inherit(struct inode *, mode_t mode, xfs_acl_t *); extern int xfs_acl_iaccess(struct xfs_inode *, mode_t, cred_t *); extern int xfs_acl_vtoacl(struct inode *, xfs_acl_t *, xfs_acl_t *); @@ -79,6 +80,7 @@ extern int xfs_acl_vremove(struct inode #define _ACL_FREE(a) ((a)? kmem_zone_free(xfs_acl_zone, (a)):(void)0) #else +#define xfs_decode_acl(name) (-EINVAL) #define xfs_acl_zone_init(zone,name) #define xfs_acl_zone_destroy(zone) #define xfs_acl_vset(v,p,sz,t) (-EOPNOTSUPP) Index: linux-2.6-tip/fs/xfs/xfs_mount.c =================================================================== --- linux-2.6-tip.orig/fs/xfs/xfs_mount.c +++ linux-2.6-tip/fs/xfs/xfs_mount.c @@ -1424,6 +1424,8 @@ xfs_mod_sb(xfs_trans_t *tp, __int64_t fi /* find modified range */ f = (xfs_sb_field_t)xfs_lowbit64((__uint64_t)fields); + if ((long)f < 0) /* work around gcc warning */ + return; ASSERT((1LL << f) & XFS_SB_MOD_BITS); first = xfs_sb_info[f].offset; Index: linux-2.6-tip/include/acpi/acpiosxf.h =================================================================== --- linux-2.6-tip.orig/include/acpi/acpiosxf.h +++ linux-2.6-tip/include/acpi/acpiosxf.h @@ -61,7 +61,7 @@ typedef enum { OSL_EC_BURST_HANDLER } acpi_execute_type; -#define ACPI_NO_UNIT_LIMIT ((u32) -1) +#define ACPI_NO_UNIT_LIMIT (INT_MAX/2) #define ACPI_MUTEX_SEM 1 /* Functions for acpi_os_signal */ @@ -144,6 +144,7 @@ void __iomem *acpi_os_map_memory(acpi_ph acpi_size length); void acpi_os_unmap_memory(void __iomem * logical_address, acpi_size size); +void early_acpi_os_unmap_memory(void __iomem * virt, acpi_size size); #ifdef ACPI_FUTURE_USAGE acpi_status Index: linux-2.6-tip/include/acpi/acpixf.h =================================================================== --- linux-2.6-tip.orig/include/acpi/acpixf.h +++ linux-2.6-tip/include/acpi/acpixf.h @@ -130,6 +130,10 @@ acpi_get_table_header(acpi_string signat struct acpi_table_header *out_table_header); acpi_status +acpi_get_table_with_size(acpi_string signature, + u32 instance, struct acpi_table_header **out_table, + acpi_size *tbl_size); +acpi_status acpi_get_table(acpi_string signature, u32 instance, struct acpi_table_header **out_table); Index: linux-2.6-tip/include/asm-frv/swab.h =================================================================== --- linux-2.6-tip.orig/include/asm-frv/swab.h +++ linux-2.6-tip/include/asm-frv/swab.h @@ -1,7 +1,7 @@ #ifndef _ASM_SWAB_H #define _ASM_SWAB_H -#include +#include #if defined(__GNUC__) && !defined(__STRICT_ANSI__) || defined(__KERNEL__) # define __SWAB_64_THRU_32__ Index: linux-2.6-tip/include/asm-generic/bug.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/bug.h +++ linux-2.6-tip/include/asm-generic/bug.h @@ -3,6 +3,10 @@ #include +#ifndef __ASSEMBLY__ +extern void __WARN_ON(const char *func, const char *file, const int line); +#endif /* __ASSEMBLY__ */ + #ifdef CONFIG_BUG #ifdef CONFIG_GENERIC_BUG @@ -103,10 +107,9 @@ extern void warn_slowpath(const char *fi #endif #ifndef WARN -#define WARN(condition, format...) ({ \ - int __ret_warn_on = !!(condition); \ - unlikely(__ret_warn_on); \ -}) +static inline int __attribute__ ((format(printf, 2, 3))) +__WARN(int condition, const char *fmt, ...) { return condition; } +#define WARN(condition, format...) __WARN(!!(condition), format) #endif #endif @@ -140,4 +143,16 @@ extern void warn_slowpath(const char *fi # define WARN_ON_SMP(x) do { } while (0) #endif +#ifdef CONFIG_PREEMPT_RT +# define BUG_ON_RT(c) BUG_ON(c) +# define BUG_ON_NONRT(c) do { } while (0) +# define WARN_ON_RT(condition) WARN_ON(condition) +# define WARN_ON_NONRT(condition) do { } while (0) +#else +# define BUG_ON_RT(c) do { } while (0) +# define BUG_ON_NONRT(c) BUG_ON(c) +# define WARN_ON_RT(condition) do { } while (0) +# define WARN_ON_NONRT(condition) WARN_ON(condition) +#endif + #endif Index: linux-2.6-tip/include/asm-generic/percpu.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/percpu.h +++ linux-2.6-tip/include/asm-generic/percpu.h @@ -9,6 +9,9 @@ */ #define per_cpu_var(var) per_cpu__##var +#define __per_cpu_var_lock(var) per_cpu__lock_##var##_locked +#define __per_cpu_var_lock_var(var) per_cpu__##var##_locked + #ifdef CONFIG_SMP /* @@ -60,6 +63,14 @@ extern unsigned long __per_cpu_offset[NR #define __raw_get_cpu_var(var) \ (*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset)) +#define per_cpu_lock(var, cpu) \ + (*SHIFT_PERCPU_PTR(&__per_cpu_var_lock(var), per_cpu_offset(cpu))) +#define per_cpu_var_locked(var, cpu) \ + (*SHIFT_PERCPU_PTR(&__per_cpu_var_lock_var(var), per_cpu_offset(cpu))) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA extern void setup_per_cpu_areas(void); @@ -68,9 +79,11 @@ extern void setup_per_cpu_areas(void); #else /* ! SMP */ #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu_var(var))) +#define per_cpu_var_locked(var, cpu) (*((void)(cpu), &__per_cpu_var_lock_var(var))) #define __get_cpu_var(var) per_cpu_var(var) #define __raw_get_cpu_var(var) per_cpu_var(var) - +#define __get_cpu_lock(var, cpu) __per_cpu_var_lock(var) +#define __get_cpu_var_locked(var, cpu) __per_cpu_var_lock_var(var) #endif /* SMP */ #ifndef PER_CPU_ATTRIBUTES @@ -79,5 +92,60 @@ extern void setup_per_cpu_areas(void); #define DECLARE_PER_CPU(type, name) extern PER_CPU_ATTRIBUTES \ __typeof__(type) per_cpu_var(name) +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern PER_CPU_ATTRIBUTES spinlock_t __per_cpu_var_lock(name); \ + extern PER_CPU_ATTRIBUTES __typeof__(type) __per_cpu_var_lock_var(name) + +/* + * Optional methods for optimized non-lvalue per-cpu variable access. + * + * @var can be a percpu variable or a field of it and its size should + * equal char, int or long. percpu_read() evaluates to a lvalue and + * all others to void. + * + * These operations are guaranteed to be atomic w.r.t. preemption. + * The generic versions use plain get/put_cpu_var(). Archs are + * encouraged to implement single-instruction alternatives which don't + * require preemption protection. + */ +#ifndef percpu_read +# define percpu_read(var) \ + ({ \ + typeof(per_cpu_var(var)) __tmp_var__; \ + __tmp_var__ = get_cpu_var(var); \ + put_cpu_var(var); \ + __tmp_var__; \ + }) +#endif + +#define __percpu_generic_to_op(var, val, op) \ +do { \ + get_cpu_var(var) op val; \ + put_cpu_var(var); \ +} while (0) + +#ifndef percpu_write +# define percpu_write(var, val) __percpu_generic_to_op(var, (val), =) +#endif + +#ifndef percpu_add +# define percpu_add(var, val) __percpu_generic_to_op(var, (val), +=) +#endif + +#ifndef percpu_sub +# define percpu_sub(var, val) __percpu_generic_to_op(var, (val), -=) +#endif + +#ifndef percpu_and +# define percpu_and(var, val) __percpu_generic_to_op(var, (val), &=) +#endif + +#ifndef percpu_or +# define percpu_or(var, val) __percpu_generic_to_op(var, (val), |=) +#endif + +#ifndef percpu_xor +# define percpu_xor(var, val) __percpu_generic_to_op(var, (val), ^=) +#endif #endif /* _ASM_GENERIC_PERCPU_H_ */ Index: linux-2.6-tip/include/asm-generic/sections.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/sections.h +++ linux-2.6-tip/include/asm-generic/sections.h @@ -9,7 +9,7 @@ extern char __bss_start[], __bss_stop[]; extern char __init_begin[], __init_end[]; extern char _sinittext[], _einittext[]; extern char _end[]; -extern char __per_cpu_start[], __per_cpu_end[]; +extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[]; extern char __kprobes_text_start[], __kprobes_text_end[]; extern char __initdata_begin[], __initdata_end[]; extern char __start_rodata[], __end_rodata[]; Index: linux-2.6-tip/include/asm-generic/vmlinux.lds.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/vmlinux.lds.h +++ linux-2.6-tip/include/asm-generic/vmlinux.lds.h @@ -430,12 +430,59 @@ *(.initcall7.init) \ *(.initcall7s.init) +/** + * PERCPU_VADDR - define output section for percpu area + * @vaddr: explicit base address (optional) + * @phdr: destination PHDR (optional) + * + * Macro which expands to output section for percpu area. If @vaddr + * is not blank, it specifies explicit base address and all percpu + * symbols will be offset from the given address. If blank, @vaddr + * always equals @laddr + LOAD_OFFSET. + * + * @phdr defines the output PHDR to use if not blank. Be warned that + * output PHDR is sticky. If @phdr is specified, the next output + * section in the linker script will go there too. @phdr should have + * a leading colon. + * + * Note that this macros defines __per_cpu_load as an absolute symbol. + * If there is no need to put the percpu section at a predetermined + * address, use PERCPU(). + */ +#define PERCPU_VADDR(vaddr, phdr) \ + VMLINUX_SYMBOL(__per_cpu_load) = .; \ + .data.percpu vaddr : AT(VMLINUX_SYMBOL(__per_cpu_load) \ + - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__per_cpu_start) = .; \ + *(.data.percpu.first) \ + *(.data.percpu.page_aligned) \ + *(.data.percpu) \ + *(.data.percpu.shared_aligned) \ + VMLINUX_SYMBOL(__per_cpu_end) = .; \ + } phdr \ + . = VMLINUX_SYMBOL(__per_cpu_load) + SIZEOF(.data.percpu); + +/** + * PERCPU - define output section for percpu area, simple version + * @align: required alignment + * + * Align to @align and outputs output section for percpu area. This + * macro doesn't maniuplate @vaddr or @phdr and __per_cpu_load and + * __per_cpu_start will be identical. + * + * This macro is equivalent to ALIGN(align); PERCPU_VADDR( , ) except + * that __per_cpu_load is defined as a relative symbol against + * .data.percpu which is required for relocatable x86_32 + * configuration. + */ #define PERCPU(align) \ . = ALIGN(align); \ - VMLINUX_SYMBOL(__per_cpu_start) = .; \ - .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \ + .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__per_cpu_load) = .; \ + VMLINUX_SYMBOL(__per_cpu_start) = .; \ + *(.data.percpu.first) \ *(.data.percpu.page_aligned) \ *(.data.percpu) \ *(.data.percpu.shared_aligned) \ - } \ - VMLINUX_SYMBOL(__per_cpu_end) = .; + VMLINUX_SYMBOL(__per_cpu_end) = .; \ + } Index: linux-2.6-tip/include/asm-m32r/swab.h =================================================================== --- linux-2.6-tip.orig/include/asm-m32r/swab.h +++ linux-2.6-tip/include/asm-m32r/swab.h @@ -1,7 +1,7 @@ #ifndef _ASM_M32R_SWAB_H #define _ASM_M32R_SWAB_H -#include +#include #if !defined(__STRICT_ANSI__) || defined(__KERNEL__) # define __SWAB_64_THRU_32__ Index: linux-2.6-tip/include/asm-mn10300/swab.h =================================================================== --- linux-2.6-tip.orig/include/asm-mn10300/swab.h +++ linux-2.6-tip/include/asm-mn10300/swab.h @@ -11,7 +11,7 @@ #ifndef _ASM_SWAB_H #define _ASM_SWAB_H -#include +#include #ifdef __GNUC__ Index: linux-2.6-tip/include/linux/acpi.h =================================================================== --- linux-2.6-tip.orig/include/linux/acpi.h +++ linux-2.6-tip/include/linux/acpi.h @@ -79,6 +79,7 @@ typedef int (*acpi_table_handler) (struc typedef int (*acpi_table_entry_handler) (struct acpi_subtable_header *header, const unsigned long end); char * __acpi_map_table (unsigned long phys_addr, unsigned long size); +void __acpi_unmap_table(char *map, unsigned long size); int early_acpi_boot_init(void); int acpi_boot_init (void); int acpi_boot_table_init (void); Index: linux-2.6-tip/include/linux/audit.h =================================================================== --- linux-2.6-tip.orig/include/linux/audit.h +++ linux-2.6-tip/include/linux/audit.h @@ -606,7 +606,8 @@ extern int audit_enabled; #define audit_log(c,g,t,f,...) do { ; } while (0) #define audit_log_start(c,g,t) ({ NULL; }) #define audit_log_vformat(b,f,a) do { ; } while (0) -#define audit_log_format(b,f,...) do { ; } while (0) +static inline void __attribute__ ((format(printf, 2, 3))) +audit_log_format(struct audit_buffer *ab, const char *fmt, ...) { } #define audit_log_end(b) do { ; } while (0) #define audit_log_n_hex(a,b,l) do { ; } while (0) #define audit_log_n_string(a,c,l) do { ; } while (0) Index: linux-2.6-tip/include/linux/blktrace_api.h =================================================================== --- linux-2.6-tip.orig/include/linux/blktrace_api.h +++ linux-2.6-tip/include/linux/blktrace_api.h @@ -144,6 +144,9 @@ struct blk_user_trace_setup { #ifdef __KERNEL__ #if defined(CONFIG_BLK_DEV_IO_TRACE) + +#include + struct blk_trace { int trace_state; struct rchan *rchan; @@ -194,6 +197,8 @@ extern int blk_trace_setup(struct reques extern int blk_trace_startstop(struct request_queue *q, int start); extern int blk_trace_remove(struct request_queue *q); +extern struct attribute_group blk_trace_attr_group; + #else /* !CONFIG_BLK_DEV_IO_TRACE */ #define blk_trace_ioctl(bdev, cmd, arg) (-ENOTTY) #define blk_trace_shutdown(q) do { } while (0) Index: linux-2.6-tip/include/linux/coda_linux.h =================================================================== --- linux-2.6-tip.orig/include/linux/coda_linux.h +++ linux-2.6-tip/include/linux/coda_linux.h @@ -51,10 +51,6 @@ void coda_vattr_to_iattr(struct inode *, void coda_iattr_to_vattr(struct iattr *, struct coda_vattr *); unsigned short coda_flags_to_cflags(unsigned short); -/* sysctl.h */ -void coda_sysctl_init(void); -void coda_sysctl_clean(void); - #define CODA_ALLOC(ptr, cast, size) do { \ if (size < PAGE_SIZE) \ ptr = kmalloc((unsigned long) size, GFP_KERNEL); \ Index: linux-2.6-tip/include/linux/coda_psdev.h =================================================================== --- linux-2.6-tip.orig/include/linux/coda_psdev.h +++ linux-2.6-tip/include/linux/coda_psdev.h @@ -6,6 +6,7 @@ #define CODA_PSDEV_MAJOR 67 #define MAX_CODADEVS 5 /* how many do we allow */ +#ifdef __KERNEL__ struct kstatfs; /* communication pending/processing queues */ @@ -24,7 +25,6 @@ static inline struct venus_comm *coda_vc return (struct venus_comm *)((sb)->s_fs_info); } - /* upcalls */ int venus_rootfid(struct super_block *sb, struct CodaFid *fidp); int venus_getattr(struct super_block *sb, struct CodaFid *fid, @@ -64,6 +64,12 @@ int coda_downcall(int opcode, union outp int venus_fsync(struct super_block *sb, struct CodaFid *fid); int venus_statfs(struct dentry *dentry, struct kstatfs *sfs); +/* + * Statistics + */ + +extern struct venus_comm coda_comms[]; +#endif /* __KERNEL__ */ /* messages between coda filesystem in kernel and Venus */ struct upc_req { @@ -82,11 +88,4 @@ struct upc_req { #define REQ_WRITE 0x4 #define REQ_ABORT 0x8 - -/* - * Statistics - */ - -extern struct venus_comm coda_comms[]; - #endif Index: linux-2.6-tip/include/linux/dma-mapping.h =================================================================== --- linux-2.6-tip.orig/include/linux/dma-mapping.h +++ linux-2.6-tip/include/linux/dma-mapping.h @@ -3,6 +3,8 @@ #include #include +#include +#include /* These definitions mirror those in pci.h, so they can be used * interchangeably with their PCI_ counterparts */ @@ -13,6 +15,52 @@ enum dma_data_direction { DMA_NONE = 3, }; +struct dma_map_ops { + void* (*alloc_coherent)(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t gfp); + void (*free_coherent)(struct device *dev, size_t size, + void *vaddr, dma_addr_t dma_handle); + dma_addr_t (*map_page)(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs); + void (*unmap_page)(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs); + int (*map_sg)(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs); + void (*unmap_sg)(struct device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction dir, + struct dma_attrs *attrs); + void (*sync_single_for_cpu)(struct device *dev, + dma_addr_t dma_handle, size_t size, + enum dma_data_direction dir); + void (*sync_single_for_device)(struct device *dev, + dma_addr_t dma_handle, size_t size, + enum dma_data_direction dir); + void (*sync_single_range_for_cpu)(struct device *dev, + dma_addr_t dma_handle, + unsigned long offset, + size_t size, + enum dma_data_direction dir); + void (*sync_single_range_for_device)(struct device *dev, + dma_addr_t dma_handle, + unsigned long offset, + size_t size, + enum dma_data_direction dir); + void (*sync_sg_for_cpu)(struct device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction dir); + void (*sync_sg_for_device)(struct device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction dir); + int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); + int (*dma_supported)(struct device *dev, u64 mask); + int is_phys; +}; + #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1)) /* Index: linux-2.6-tip/include/linux/elfcore.h =================================================================== --- linux-2.6-tip.orig/include/linux/elfcore.h +++ linux-2.6-tip/include/linux/elfcore.h @@ -111,6 +111,15 @@ static inline void elf_core_copy_regs(el #endif } +static inline void elf_core_copy_kernel_regs(elf_gregset_t *elfregs, struct pt_regs *regs) +{ +#ifdef ELF_CORE_COPY_KERNEL_REGS + ELF_CORE_COPY_KERNEL_REGS((*elfregs), regs); +#else + elf_core_copy_regs(elfregs, regs); +#endif +} + static inline int elf_core_copy_task_regs(struct task_struct *t, elf_gregset_t* elfregs) { #ifdef ELF_CORE_COPY_TASK_REGS Index: linux-2.6-tip/include/linux/fs.h =================================================================== --- linux-2.6-tip.orig/include/linux/fs.h +++ linux-2.6-tip/include/linux/fs.h @@ -671,7 +671,7 @@ struct inode { umode_t i_mode; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct mutex i_mutex; - struct rw_semaphore i_alloc_sem; + struct compat_rw_semaphore i_alloc_sem; const struct inode_operations *i_op; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; @@ -1081,13 +1081,25 @@ extern int lock_may_write(struct inode * #define posix_lock_file_wait(a, b) ({ -ENOLCK; }) #define posix_unblock_lock(a, b) (-ENOENT) #define vfs_test_lock(a, b) ({ 0; }) -#define vfs_lock_file(a, b, c, d) (-ENOLCK) +static inline int +vfs_lock_file(struct file *filp, unsigned int cmd, + struct file_lock *fl, struct file_lock *conf) +{ + return -ENOLCK; +} #define vfs_cancel_lock(a, b) ({ 0; }) #define flock_lock_file_wait(a, b) ({ -ENOLCK; }) #define __break_lease(a, b) ({ 0; }) -#define lease_get_mtime(a, b) ({ }) +static inline void lease_get_mtime(struct inode *inode, struct timespec *time) +{ + *time = (struct timespec) { 0, }; +} #define generic_setlease(a, b, c) ({ -EINVAL; }) -#define vfs_setlease(a, b, c) ({ -EINVAL; }) +static inline int +vfs_setlease(struct file *filp, long arg, struct file_lock **lease) +{ + return -EINVAL; +} #define lease_modify(a, b) ({ -EINVAL; }) #define lock_may_read(a, b, c) ({ 1; }) #define lock_may_write(a, b, c) ({ 1; }) @@ -1611,9 +1623,9 @@ int __put_super_and_need_restart(struct /* Alas, no aliases. Too much hassle with bringing module.h everywhere */ #define fops_get(fops) \ - (((fops) && try_module_get((fops)->owner) ? (fops) : NULL)) + (((fops != NULL) && try_module_get((fops)->owner) ? (fops) : NULL)) #define fops_put(fops) \ - do { if (fops) module_put((fops)->owner); } while(0) + do { if (fops != NULL) module_put((fops)->owner); } while(0) extern int register_filesystem(struct file_system_type *); extern int unregister_filesystem(struct file_system_type *); @@ -1689,7 +1701,7 @@ static inline int break_lease(struct ino #else /* !CONFIG_FILE_LOCKING */ #define locks_mandatory_locked(a) ({ 0; }) #define locks_mandatory_area(a, b, c, d, e) ({ 0; }) -#define __mandatory_lock(a) ({ 0; }) +static inline int __mandatory_lock(struct inode *ino) { return 0; } #define mandatory_lock(a) ({ 0; }) #define locks_verify_locked(a) ({ 0; }) #define locks_verify_truncate(a, b, c) ({ 0; }) Index: linux-2.6-tip/include/linux/ftrace.h =================================================================== --- linux-2.6-tip.orig/include/linux/ftrace.h +++ linux-2.6-tip/include/linux/ftrace.h @@ -95,10 +95,44 @@ stack_trace_sysctl(struct ctl_table *tab loff_t *ppos); #endif +struct ftrace_func_command { + struct list_head list; + char *name; + int (*func)(char *func, char *cmd, + char *params, int enable); +}; + #ifdef CONFIG_DYNAMIC_FTRACE /* asm/ftrace.h must be defined for archs supporting dynamic ftrace */ #include +int ftrace_arch_code_modify_prepare(void); +int ftrace_arch_code_modify_post_process(void); + +struct seq_file; + +struct ftrace_probe_ops { + void (*func)(unsigned long ip, + unsigned long parent_ip, + void **data); + int (*callback)(unsigned long ip, void **data); + void (*free)(void **data); + int (*print)(struct seq_file *m, + unsigned long ip, + struct ftrace_probe_ops *ops, + void *data); +}; + +extern int +register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops, + void *data); +extern void +unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops, + void *data); +extern void +unregister_ftrace_function_probe_func(char *glob, struct ftrace_probe_ops *ops); +extern void unregister_ftrace_function_probe_all(char *glob); + enum { FTRACE_FL_FREE = (1 << 0), FTRACE_FL_FAILED = (1 << 1), @@ -119,6 +153,9 @@ struct dyn_ftrace { int ftrace_force_update(void); void ftrace_set_filter(unsigned char *buf, int len, int reset); +int register_ftrace_command(struct ftrace_func_command *cmd); +int unregister_ftrace_command(struct ftrace_func_command *cmd); + /* defined in arch */ extern int ftrace_ip_converted(unsigned long ip); extern int ftrace_dyn_arch_init(void *data); @@ -126,6 +163,10 @@ extern int ftrace_update_ftrace_func(ftr extern void ftrace_caller(void); extern void ftrace_call(void); extern void mcount_call(void); + +#ifndef FTRACE_ADDR +#define FTRACE_ADDR ((unsigned long)ftrace_caller) +#endif #ifdef CONFIG_FUNCTION_GRAPH_TRACER extern void ftrace_graph_caller(void); extern int ftrace_enable_ftrace_graph_caller(void); @@ -136,7 +177,7 @@ static inline int ftrace_disable_ftrace_ #endif /** - * ftrace_make_nop - convert code into top + * ftrace_make_nop - convert code into nop * @mod: module structure if called by module load initialization * @rec: the mcount call site record * @addr: the address that the call site should be calling @@ -198,6 +239,14 @@ extern void ftrace_enable_daemon(void); # define ftrace_disable_daemon() do { } while (0) # define ftrace_enable_daemon() do { } while (0) static inline void ftrace_release(void *start, unsigned long size) { } +static inline int register_ftrace_command(struct ftrace_func_command *cmd) +{ + return -EINVAL; +} +static inline int unregister_ftrace_command(char *cmd_name) +{ + return -EINVAL; +} #endif /* CONFIG_DYNAMIC_FTRACE */ /* totally disable ftrace - can not re-enable after this */ @@ -298,6 +347,9 @@ ftrace_special(unsigned long arg1, unsig extern int __ftrace_printk(unsigned long ip, const char *fmt, ...) __attribute__ ((format (printf, 2, 3))); +# define ftrace_vprintk(fmt, ap) __ftrace_printk(_THIS_IP_, fmt, ap) +extern int +__ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap); extern void ftrace_dump(void); #else static inline void @@ -313,6 +365,11 @@ ftrace_printk(const char *fmt, ...) { return 0; } +static inline int +ftrace_vprintk(const char *fmt, va_list ap) +{ + return 0; +} static inline void ftrace_dump(void) { } #endif @@ -327,36 +384,6 @@ ftrace_init_module(struct module *mod, unsigned long *start, unsigned long *end) { } #endif -enum { - POWER_NONE = 0, - POWER_CSTATE = 1, - POWER_PSTATE = 2, -}; - -struct power_trace { -#ifdef CONFIG_POWER_TRACER - ktime_t stamp; - ktime_t end; - int type; - int state; -#endif -}; - -#ifdef CONFIG_POWER_TRACER -extern void trace_power_start(struct power_trace *it, unsigned int type, - unsigned int state); -extern void trace_power_mark(struct power_trace *it, unsigned int type, - unsigned int state); -extern void trace_power_end(struct power_trace *it); -#else -static inline void trace_power_start(struct power_trace *it, unsigned int type, - unsigned int state) { } -static inline void trace_power_mark(struct power_trace *it, unsigned int type, - unsigned int state) { } -static inline void trace_power_end(struct power_trace *it) { } -#endif - - /* * Structure that defines an entry function trace. */ @@ -380,6 +407,30 @@ struct ftrace_graph_ret { #ifdef CONFIG_FUNCTION_GRAPH_TRACER /* + * Stack of return addresses for functions + * of a thread. + * Used in struct thread_info + */ +struct ftrace_ret_stack { + unsigned long ret; + unsigned long func; + unsigned long long calltime; +}; + +/* + * Primary handler of a function return. + * It relays on ftrace_return_to_handler. + * Defined in entry_32/64.S + */ +extern void return_to_handler(void); + +extern int +ftrace_push_return_trace(unsigned long ret, unsigned long long time, + unsigned long func, int *depth); +extern void +ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret); + +/* * Sometimes we don't want to trace a function with the function * graph tracer but we want them to keep traced by the usual function * tracer if the function graph tracer is not configured. @@ -492,4 +543,17 @@ static inline int test_tsk_trace_graph(s #endif /* CONFIG_TRACING */ + +#ifdef CONFIG_HW_BRANCH_TRACER + +void trace_hw_branch(u64 from, u64 to); +void trace_hw_branch_oops(void); + +#else /* CONFIG_HW_BRANCH_TRACER */ + +static inline void trace_hw_branch(u64 from, u64 to) {} +static inline void trace_hw_branch_oops(void) {} + +#endif /* CONFIG_HW_BRANCH_TRACER */ + #endif /* _LINUX_FTRACE_H */ Index: linux-2.6-tip/include/linux/ftrace_irq.h =================================================================== --- linux-2.6-tip.orig/include/linux/ftrace_irq.h +++ linux-2.6-tip/include/linux/ftrace_irq.h @@ -2,7 +2,7 @@ #define _LINUX_FTRACE_IRQ_H -#if defined(CONFIG_DYNAMIC_FTRACE) || defined(CONFIG_FUNCTION_GRAPH_TRACER) +#ifdef CONFIG_FTRACE_NMI_ENTER extern void ftrace_nmi_enter(void); extern void ftrace_nmi_exit(void); #else Index: linux-2.6-tip/include/linux/gfp.h =================================================================== --- linux-2.6-tip.orig/include/linux/gfp.h +++ linux-2.6-tip/include/linux/gfp.h @@ -51,7 +51,13 @@ struct vm_area_struct; #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */ #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */ -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */ +#ifdef CONFIG_KMEMCHECK +#define __GFP_NOTRACK ((__force gfp_t)0x200000u) /* Don't track with kmemcheck */ +#else +#define __GFP_NOTRACK ((__force gfp_t)0) +#endif + +#define __GFP_BITS_SHIFT 22 /* Room for 22 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ Index: linux-2.6-tip/include/linux/hardirq.h =================================================================== --- linux-2.6-tip.orig/include/linux/hardirq.h +++ linux-2.6-tip/include/linux/hardirq.h @@ -15,71 +15,74 @@ * - bits 0-7 are the preemption count (max preemption depth: 256) * - bits 8-15 are the softirq count (max # of softirqs: 256) * - * The hardirq count can be overridden per architecture, the default is: + * The hardirq count can in theory reach the same as NR_IRQS. + * In reality, the number of nested IRQS is limited to the stack + * size as well. For archs with over 1000 IRQS it is not practical + * to expect that they will all nest. We give a max of 10 bits for + * hardirq nesting. An arch may choose to give less than 10 bits. + * m68k expects it to be 8. * - * - bits 16-27 are the hardirq count (max # of hardirqs: 4096) - * - ( bit 28 is the PREEMPT_ACTIVE flag. ) + * - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024) + * - bit 26 is the NMI_MASK + * - bit 28 is the PREEMPT_ACTIVE flag * * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 - * HARDIRQ_MASK: 0x0fff0000 + * HARDIRQ_MASK: 0x03ff0000 + * NMI_MASK: 0x04000000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 +#define NMI_BITS 1 -#ifndef HARDIRQ_BITS -#define HARDIRQ_BITS 12 +#define MAX_HARDIRQ_BITS 10 -#ifndef MAX_HARDIRQS_PER_CPU -#define MAX_HARDIRQS_PER_CPU NR_IRQS +#ifndef HARDIRQ_BITS +# define HARDIRQ_BITS MAX_HARDIRQ_BITS #endif -/* - * The hardirq mask has to be large enough to have space for potentially - * all IRQ sources in the system nesting on a single CPU. - */ -#if (1 << HARDIRQ_BITS) < MAX_HARDIRQS_PER_CPU -# error HARDIRQ_BITS is too low! -#endif +#if HARDIRQ_BITS > MAX_HARDIRQ_BITS +#error HARDIRQ_BITS too high! #endif #define PREEMPT_SHIFT 0 #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) +#define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) #define __IRQ_MASK(x) ((1UL << (x))-1) #define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) +#define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) +#define NMI_OFFSET (1UL << NMI_SHIFT) -#if PREEMPT_ACTIVE < (1 << (HARDIRQ_SHIFT + HARDIRQ_BITS)) +#if PREEMPT_ACTIVE < (1 << (NMI_SHIFT + NMI_BITS)) #error PREEMPT_ACTIVE is too low! #endif #define hardirq_count() (preempt_count() & HARDIRQ_MASK) #define softirq_count() (preempt_count() & SOFTIRQ_MASK) -#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ + | NMI_MASK)) /* * Are we doing bottom half or hardware interrupt processing? * Are we in a softirq context? Interrupt context? */ -#define in_irq() (hardirq_count()) -#define in_softirq() (softirq_count()) -#define in_interrupt() (irq_count()) - -#if defined(CONFIG_PREEMPT) -# define PREEMPT_INATOMIC_BASE kernel_locked() -# define PREEMPT_CHECK_OFFSET 1 -#else -# define PREEMPT_INATOMIC_BASE 0 -# define PREEMPT_CHECK_OFFSET 0 -#endif +#define in_irq() (hardirq_count() || (current->flags & PF_HARDIRQ)) +#define in_softirq() (softirq_count() || (current->flags & PF_SOFTIRQ)) +#define in_interrupt() (irq_count()) + +/* + * Are we in NMI context? + */ +#define in_nmi() (preempt_count() & NMI_MASK) /* * Are we running in atomic context? WARNING: this macro cannot @@ -88,14 +91,7 @@ * used in the general case to determine whether sleeping is possible. * Do not use in_atomic() in driver code. */ -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE) - -/* - * Check whether we were atomic before we did preempt_disable(): - * (used by the scheduler, *after* releasing the kernel lock) - */ -#define in_atomic_preempt_off() \ - ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) #ifdef CONFIG_PREEMPT # define preemptible() (preempt_count() == 0 && !irqs_disabled()) @@ -164,20 +160,24 @@ extern void irq_enter(void); */ extern void irq_exit(void); -#define nmi_enter() \ - do { \ - ftrace_nmi_enter(); \ - lockdep_off(); \ - rcu_nmi_enter(); \ - __irq_enter(); \ +#define nmi_enter() \ + do { \ + ftrace_nmi_enter(); \ + BUG_ON(in_nmi()); \ + add_preempt_count(NMI_OFFSET + HARDIRQ_OFFSET); \ + lockdep_off(); \ + rcu_nmi_enter(); \ + trace_hardirq_enter(); \ } while (0) -#define nmi_exit() \ - do { \ - __irq_exit(); \ - rcu_nmi_exit(); \ - lockdep_on(); \ - ftrace_nmi_exit(); \ +#define nmi_exit() \ + do { \ + trace_hardirq_exit(); \ + rcu_nmi_exit(); \ + lockdep_on(); \ + BUG_ON(!in_nmi()); \ + sub_preempt_count(NMI_OFFSET + HARDIRQ_OFFSET); \ + ftrace_nmi_exit(); \ } while (0) #endif /* LINUX_HARDIRQ_H */ Index: linux-2.6-tip/include/linux/in6.h =================================================================== --- linux-2.6-tip.orig/include/linux/in6.h +++ linux-2.6-tip/include/linux/in6.h @@ -44,11 +44,11 @@ struct in6_addr * NOTE: Be aware the IN6ADDR_* constants and in6addr_* externals are defined * in network byte order, not in host byte order as are the IPv4 equivalents */ +#ifdef __KERNEL__ extern const struct in6_addr in6addr_any; #define IN6ADDR_ANY_INIT { { { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 } } } extern const struct in6_addr in6addr_loopback; #define IN6ADDR_LOOPBACK_INIT { { { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 } } } -#ifdef __KERNEL__ extern const struct in6_addr in6addr_linklocal_allnodes; #define IN6ADDR_LINKLOCAL_ALLNODES_INIT \ { { { 0xff,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1 } } } Index: linux-2.6-tip/include/linux/init.h =================================================================== --- linux-2.6-tip.orig/include/linux/init.h +++ linux-2.6-tip/include/linux/init.h @@ -313,16 +313,20 @@ void __init parse_early_param(void); #define __initdata_or_module __initdata #endif /*CONFIG_MODULES*/ -/* Functions marked as __devexit may be discarded at kernel link time, depending - on config options. Newer versions of binutils detect references from - retained sections to discarded sections and flag an error. Pointers to - __devexit functions must use __devexit_p(function_name), the wrapper will - insert either the function_name or NULL, depending on the config options. +/* + * Functions marked as __devexit may be discarded at kernel link time, + * depending on config options. Newer versions of binutils detect + * references from retained sections to discarded sections and flag an + * error. + * + * Pointers to __devexit functions must use __devexit_p(function_name), + * the wrapper will insert either the function_name or NULL, depending on + * the config options. */ #if defined(MODULE) || defined(CONFIG_HOTPLUG) -#define __devexit_p(x) x +# define __devexit_p(x) x #else -#define __devexit_p(x) NULL +# define __devexit_p(x) ((void *)((long)(x) & 0) /* NULL */) #endif #ifdef MODULE Index: linux-2.6-tip/include/linux/init_task.h =================================================================== --- linux-2.6-tip.orig/include/linux/init_task.h +++ linux-2.6-tip/include/linux/init_task.h @@ -10,6 +10,7 @@ #include #include #include +#include extern struct files_struct init_files; extern struct fs_struct init_fs; @@ -51,7 +52,7 @@ extern struct fs_struct init_fs; .cputimer = { \ .cputime = INIT_CPUTIME, \ .running = 0, \ - .lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \ + .lock = RAW_SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \ }, \ } @@ -120,6 +121,16 @@ extern struct group_info init_groups; extern struct cred init_cred; +#ifdef CONFIG_PERF_COUNTERS +# define INIT_PERF_COUNTERS(tsk) \ + .perf_counter_ctx.counter_list = \ + LIST_HEAD_INIT(tsk.perf_counter_ctx.counter_list), \ + .perf_counter_ctx.lock = \ + RAW_SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx.lock), +#else +# define INIT_PERF_COUNTERS(tsk) +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) @@ -147,6 +158,7 @@ extern struct cred init_cred; .nr_cpus_allowed = NR_CPUS, \ }, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ + .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ .real_parent = &tsk, \ @@ -173,8 +185,9 @@ extern struct cred init_cred; .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl = ATOMIC_INIT(0), \ - .pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ .timer_slack_ns = 50000, /* 50 usec default slack */ \ + .posix_timer_list = NULL, \ + .pi_lock = RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ .pids = { \ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \ @@ -184,6 +197,7 @@ extern struct cred init_cred; INIT_IDS \ INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP \ + INIT_PERF_COUNTERS(tsk) \ } Index: linux-2.6-tip/include/linux/intel-iommu.h =================================================================== --- linux-2.6-tip.orig/include/linux/intel-iommu.h +++ linux-2.6-tip/include/linux/intel-iommu.h @@ -330,11 +330,4 @@ extern int qi_flush_iotlb(struct intel_i extern void qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu); -extern void *intel_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t); -extern void intel_free_coherent(struct device *, size_t, void *, dma_addr_t); -extern dma_addr_t intel_map_single(struct device *, phys_addr_t, size_t, int); -extern void intel_unmap_single(struct device *, dma_addr_t, size_t, int); -extern int intel_map_sg(struct device *, struct scatterlist *, int, int); -extern void intel_unmap_sg(struct device *, struct scatterlist *, int, int); - #endif Index: linux-2.6-tip/include/linux/interrupt.h =================================================================== --- linux-2.6-tip.orig/include/linux/interrupt.h +++ linux-2.6-tip/include/linux/interrupt.h @@ -54,10 +54,12 @@ #define IRQF_SAMPLE_RANDOM 0x00000040 #define IRQF_SHARED 0x00000080 #define IRQF_PROBE_SHARED 0x00000100 -#define IRQF_TIMER 0x00000200 +#define __IRQF_TIMER 0x00000200 #define IRQF_PERCPU 0x00000400 #define IRQF_NOBALANCING 0x00000800 #define IRQF_IRQPOLL 0x00001000 +#define IRQF_NODELAY 0x00002000 +#define IRQF_TIMER (__IRQF_TIMER | IRQF_NODELAY) typedef irqreturn_t (*irq_handler_t)(int, void *); @@ -69,7 +71,7 @@ struct irqaction { void *dev_id; struct irqaction *next; int irq; - struct proc_dir_entry *dir; + struct proc_dir_entry *dir, *threaded; }; extern irqreturn_t no_action(int cpl, void *dev_id); @@ -99,7 +101,7 @@ extern void devm_free_irq(struct device #ifdef CONFIG_LOCKDEP # define local_irq_enable_in_hardirq() do { } while (0) #else -# define local_irq_enable_in_hardirq() local_irq_enable() +# define local_irq_enable_in_hardirq() local_irq_enable_nort() #endif extern void disable_irq_nosync(unsigned int irq); @@ -224,6 +226,7 @@ static inline int disable_irq_wake(unsig #ifndef __ARCH_SET_SOFTIRQ_PENDING #define set_softirq_pending(x) (local_softirq_pending() = (x)) +// FIXME: PREEMPT_RT: set_bit()? #define or_softirq_pending(x) (local_softirq_pending() |= (x)) #endif @@ -254,6 +257,8 @@ enum SCHED_SOFTIRQ, HRTIMER_SOFTIRQ, RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */ + /* Entries after this are ignored in split softirq mode */ + MAX_SOFTIRQ, NR_SOFTIRQS }; @@ -267,13 +272,21 @@ struct softirq_action void (*action)(struct softirq_action *); }; +#ifdef CONFIG_PREEMPT_HARDIRQS +# define __raise_softirq_irqoff(nr) raise_softirq_irqoff(nr) +# define __do_raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) +#else +# define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) +# define __do_raise_softirq_irqoff(nr) __raise_softirq_irqoff(nr) +#endif + asmlinkage void do_softirq(void); asmlinkage void __do_softirq(void); extern void open_softirq(int nr, void (*action)(struct softirq_action *)); extern void softirq_init(void); -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) extern void raise_softirq_irqoff(unsigned int nr); extern void raise_softirq(unsigned int nr); +extern void wakeup_irqd(void); /* This is the worklist that queues up per-cpu softirq work. * @@ -283,6 +296,11 @@ extern void raise_softirq(unsigned int n * only be accessed by the local cpu that they are for. */ DECLARE_PER_CPU(struct list_head [NR_SOFTIRQS], softirq_work_list); +#ifdef CONFIG_PREEMPT_SOFTIRQS +extern void wait_for_softirq(int softirq); +#else +# define wait_for_softirq(x) do {} while(0) +#endif /* Try to send a softirq to a remote cpu. If this cannot be done, the * work will be queued to the local cpu. @@ -308,8 +326,9 @@ extern void __send_remote_softirq(struct to be executed on some cpu at least once after this. * If the tasklet is already scheduled, but its excecution is still not started, it will be executed only once. - * If this tasklet is already running on another CPU (or schedule is called - from tasklet itself), it is rescheduled for later. + * If this tasklet is already running on another CPU, it is rescheduled + for later. + * Schedule must not be called from the tasklet itself (a lockup occurs) * Tasklet is strictly serialized wrt itself, but not wrt another tasklets. If client needs some intertask synchronization, he makes it with spinlocks. @@ -334,27 +353,36 @@ struct tasklet_struct name = { NULL, 0, enum { TASKLET_STATE_SCHED, /* Tasklet is scheduled for execution */ - TASKLET_STATE_RUN /* Tasklet is running (SMP only) */ + TASKLET_STATE_RUN, /* Tasklet is running (SMP only) */ + TASKLET_STATE_PENDING /* Tasklet is pending */ }; -#ifdef CONFIG_SMP +#define TASKLET_STATEF_SCHED (1 << TASKLET_STATE_SCHED) +#define TASKLET_STATEF_RUN (1 << TASKLET_STATE_RUN) +#define TASKLET_STATEF_PENDING (1 << TASKLET_STATE_PENDING) + +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) static inline int tasklet_trylock(struct tasklet_struct *t) { return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); } +static inline int tasklet_tryunlock(struct tasklet_struct *t) +{ + return cmpxchg(&t->state, TASKLET_STATEF_RUN, 0) == TASKLET_STATEF_RUN; +} + static inline void tasklet_unlock(struct tasklet_struct *t) { smp_mb__before_clear_bit(); clear_bit(TASKLET_STATE_RUN, &(t)->state); } -static inline void tasklet_unlock_wait(struct tasklet_struct *t) -{ - while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); } -} +extern void tasklet_unlock_wait(struct tasklet_struct *t); + #else #define tasklet_trylock(t) 1 +#define tasklet_tryunlock(t) 1 #define tasklet_unlock_wait(t) do { } while (0) #define tasklet_unlock(t) do { } while (0) #endif @@ -375,6 +403,20 @@ static inline void tasklet_hi_schedule(s __tasklet_hi_schedule(t); } +extern void __tasklet_hi_schedule_first(struct tasklet_struct *t); + +/* + * This version avoids touching any other tasklets. Needed for kmemcheck + * in order not to take any page faults while enqueueing this tasklet; + * consider VERY carefully whether you really need this or + * tasklet_hi_schedule()... + */ +static inline void tasklet_hi_schedule_first(struct tasklet_struct *t) +{ + if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) + __tasklet_hi_schedule_first(t); +} + static inline void tasklet_disable_nosync(struct tasklet_struct *t) { @@ -389,22 +431,14 @@ static inline void tasklet_disable(struc smp_mb(); } -static inline void tasklet_enable(struct tasklet_struct *t) -{ - smp_mb__before_atomic_dec(); - atomic_dec(&t->count); -} - -static inline void tasklet_hi_enable(struct tasklet_struct *t) -{ - smp_mb__before_atomic_dec(); - atomic_dec(&t->count); -} +extern void tasklet_enable(struct tasklet_struct *t); +extern void tasklet_hi_enable(struct tasklet_struct *t); extern void tasklet_kill(struct tasklet_struct *t); extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu); extern void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data); +void takeover_tasklets(unsigned int cpu); /* * Autoprobing for irqs: @@ -462,12 +496,52 @@ static inline void init_irq_proc(void) } #endif +#if defined(CONFIG_GENERIC_HARDIRQS) && defined(CONFIG_DEBUG_SHIRQ) +extern void debug_poll_all_shared_irqs(void); +#else +static inline void debug_poll_all_shared_irqs(void) { } +#endif + int show_interrupts(struct seq_file *p, void *v); struct irq_desc; extern int early_irq_init(void); +extern int arch_probe_nr_irqs(void); extern int arch_early_irq_init(void); extern int arch_init_chip_data(struct irq_desc *desc, int cpu); +#ifdef CONFIG_PREEMPT_RT +# define local_irq_disable_nort() do { } while (0) +# define local_irq_enable_nort() do { } while (0) +# define local_irq_enable_rt() local_irq_enable() +# define local_irq_save_nort(flags) do { local_save_flags(flags); } while (0) +# define local_irq_restore_nort(flags) do { (void)(flags); } while (0) +# define spin_lock_nort(lock) do { } while (0) +# define spin_unlock_nort(lock) do { } while (0) +# define spin_lock_bh_nort(lock) do { } while (0) +# define spin_unlock_bh_nort(lock) do { } while (0) +# define spin_lock_rt(lock) spin_lock(lock) +# define spin_unlock_rt(lock) spin_unlock(lock) +# define smp_processor_id_rt(cpu) (cpu) +# define in_atomic_rt() (!oops_in_progress && \ + (in_atomic() || irqs_disabled())) +# define read_trylock_rt(lock) ({read_lock(lock); 1; }) +#else +# define local_irq_disable_nort() local_irq_disable() +# define local_irq_enable_nort() local_irq_enable() +# define local_irq_enable_rt() do { } while (0) +# define local_irq_save_nort(flags) local_irq_save(flags) +# define local_irq_restore_nort(flags) local_irq_restore(flags) +# define spin_lock_rt(lock) do { } while (0) +# define spin_unlock_rt(lock) do { } while (0) +# define spin_lock_nort(lock) spin_lock(lock) +# define spin_unlock_nort(lock) spin_unlock(lock) +# define spin_lock_bh_nort(lock) spin_lock_bh(lock) +# define spin_unlock_bh_nort(lock) spin_unlock_bh(lock) +# define smp_processor_id_rt(cpu) smp_processor_id() +# define in_atomic_rt() 0 +# define read_trylock_rt(lock) read_trylock(lock) +#endif + #endif Index: linux-2.6-tip/include/linux/irq.h =================================================================== --- linux-2.6-tip.orig/include/linux/irq.h +++ linux-2.6-tip/include/linux/irq.h @@ -20,10 +20,12 @@ #include #include #include +#include #include #include #include +#include struct irq_desc; typedef void (*irq_flow_handler_t)(unsigned int irq, @@ -65,6 +67,7 @@ typedef void (*irq_flow_handler_t)(unsig #define IRQ_SPURIOUS_DISABLED 0x00800000 /* IRQ was disabled by the spurious trap */ #define IRQ_MOVE_PCNTXT 0x01000000 /* IRQ migration from process context */ #define IRQ_AFFINITY_SET 0x02000000 /* IRQ affinity was set from userspace*/ +#define IRQ_NODELAY 0x40000000 /* IRQ must run immediately */ #ifdef CONFIG_IRQ_PER_CPU # define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU) @@ -151,6 +154,8 @@ struct irq_2_iommu; * @irq_count: stats field to detect stalled irqs * @last_unhandled: aging timer for unhandled count * @irqs_unhandled: stats field for spurious unhandled interrupts + * @thread: Thread pointer for threaded preemptible irq handling + * @wait_for_handler: Waitqueue to wait for a running preemptible handler * @lock: locking for SMP * @affinity: IRQ affinity on SMP * @cpu: cpu index useful for balancing @@ -160,12 +165,10 @@ struct irq_2_iommu; */ struct irq_desc { unsigned int irq; -#ifdef CONFIG_SPARSE_IRQ struct timer_rand_state *timer_rand_state; unsigned int *kstat_irqs; -# ifdef CONFIG_INTR_REMAP +#ifdef CONFIG_INTR_REMAP struct irq_2_iommu *irq_2_iommu; -# endif #endif irq_flow_handler_t handle_irq; struct irq_chip *chip; @@ -180,13 +183,16 @@ struct irq_desc { unsigned int irq_count; /* For detecting broken IRQs */ unsigned long last_unhandled; /* Aging timer for unhandled count */ unsigned int irqs_unhandled; - spinlock_t lock; + struct task_struct *thread; + wait_queue_head_t wait_for_handler; + cycles_t timestamp; + raw_spinlock_t lock; #ifdef CONFIG_SMP - cpumask_t affinity; + cpumask_var_t affinity; unsigned int cpu; -#endif #ifdef CONFIG_GENERIC_PENDING_IRQ - cpumask_t pending_mask; + cpumask_var_t pending_mask; +#endif #endif #ifdef CONFIG_PROC_FS struct proc_dir_entry *dir; @@ -202,12 +208,6 @@ extern void arch_free_chip_data(struct i extern struct irq_desc irq_desc[NR_IRQS]; #else /* CONFIG_SPARSE_IRQ */ extern struct irq_desc *move_irq_desc(struct irq_desc *old_desc, int cpu); - -#define kstat_irqs_this_cpu(DESC) \ - ((DESC)->kstat_irqs[smp_processor_id()]) -#define kstat_incr_irqs_this_cpu(irqno, DESC) \ - ((DESC)->kstat_irqs[smp_processor_id()]++) - #endif /* CONFIG_SPARSE_IRQ */ extern struct irq_desc *irq_to_desc_alloc_cpu(unsigned int irq, int cpu); @@ -418,8 +418,102 @@ extern int set_irq_msi(unsigned int irq, #define get_irq_desc_data(desc) ((desc)->handler_data) #define get_irq_desc_msi(desc) ((desc)->msi_desc) -#endif /* CONFIG_GENERIC_HARDIRQS */ +/* Early initialization of irqs */ +extern void early_init_hardirqs(void); + +#if defined(CONFIG_PREEMPT_HARDIRQS) +extern void init_hardirqs(void); +#else +static inline void init_hardirqs(void) { } +#endif + +#else /* end GENERIC HARDIRQS */ + +static inline void early_init_hardirqs(void) { } +static inline void init_hardirqs(void) { } + +#endif /* !CONFIG_GENERIC_HARDIRQS */ #endif /* !CONFIG_S390 */ +#ifdef CONFIG_SMP +/** + * init_alloc_desc_masks - allocate cpumasks for irq_desc + * @desc: pointer to irq_desc struct + * @cpu: cpu which will be handling the cpumasks + * @boot: true if need bootmem + * + * Allocates affinity and pending_mask cpumask if required. + * Returns true if successful (or not required). + * Side effect: affinity has all bits set, pending_mask has all bits clear. + */ +static inline bool init_alloc_desc_masks(struct irq_desc *desc, int cpu, + bool boot) +{ + int node; + + if (boot) { + alloc_bootmem_cpumask_var(&desc->affinity); + cpumask_setall(desc->affinity); + +#ifdef CONFIG_GENERIC_PENDING_IRQ + alloc_bootmem_cpumask_var(&desc->pending_mask); + cpumask_clear(desc->pending_mask); +#endif + return true; + } + + node = cpu_to_node(cpu); + + if (!alloc_cpumask_var_node(&desc->affinity, GFP_ATOMIC, node)) + return false; + cpumask_setall(desc->affinity); + +#ifdef CONFIG_GENERIC_PENDING_IRQ + if (!alloc_cpumask_var_node(&desc->pending_mask, GFP_ATOMIC, node)) { + free_cpumask_var(desc->affinity); + return false; + } + cpumask_clear(desc->pending_mask); +#endif + return true; +} + +/** + * init_copy_desc_masks - copy cpumasks for irq_desc + * @old_desc: pointer to old irq_desc struct + * @new_desc: pointer to new irq_desc struct + * + * Insures affinity and pending_masks are copied to new irq_desc. + * If !CONFIG_CPUMASKS_OFFSTACK the cpumasks are embedded in the + * irq_desc struct so the copy is redundant. + */ + +static inline void init_copy_desc_masks(struct irq_desc *old_desc, + struct irq_desc *new_desc) +{ +#ifdef CONFIG_CPUMASKS_OFFSTACK + cpumask_copy(new_desc->affinity, old_desc->affinity); + +#ifdef CONFIG_GENERIC_PENDING_IRQ + cpumask_copy(new_desc->pending_mask, old_desc->pending_mask); +#endif +#endif +} + +#else /* !CONFIG_SMP */ + +static inline bool init_alloc_desc_masks(struct irq_desc *desc, int cpu, + bool boot) +{ + return true; +} + +static inline void init_copy_desc_masks(struct irq_desc *old_desc, + struct irq_desc *new_desc) +{ +} + +#endif /* CONFIG_SMP */ + #endif /* _LINUX_IRQ_H */ Index: linux-2.6-tip/include/linux/irqnr.h =================================================================== --- linux-2.6-tip.orig/include/linux/irqnr.h +++ linux-2.6-tip/include/linux/irqnr.h @@ -20,6 +20,7 @@ # define for_each_irq_desc_reverse(irq, desc) \ for (irq = nr_irqs - 1; irq >= 0; irq--) + #else /* CONFIG_GENERIC_HARDIRQS */ extern int nr_irqs; @@ -28,13 +29,17 @@ extern struct irq_desc *irq_to_desc(unsi # define for_each_irq_desc(irq, desc) \ for (irq = 0, desc = irq_to_desc(irq); irq < nr_irqs; \ irq++, desc = irq_to_desc(irq)) \ - if (desc) + if (!desc) \ + ; \ + else # define for_each_irq_desc_reverse(irq, desc) \ for (irq = nr_irqs - 1, desc = irq_to_desc(irq); irq >= 0; \ irq--, desc = irq_to_desc(irq)) \ - if (desc) + if (!desc) \ + ; \ + else #endif /* CONFIG_GENERIC_HARDIRQS */ Index: linux-2.6-tip/include/linux/kernel.h =================================================================== --- linux-2.6-tip.orig/include/linux/kernel.h +++ linux-2.6-tip/include/linux/kernel.h @@ -122,7 +122,7 @@ extern int _cond_resched(void); # define might_resched() do { } while (0) #endif -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) void __might_sleep(char *file, int line); /** * might_sleep - annotation for functions that can sleep @@ -242,6 +242,19 @@ extern struct ratelimit_state printk_rat extern int printk_ratelimit(void); extern bool printk_timed_ratelimit(unsigned long *caller_jiffies, unsigned int interval_msec); + +/* + * Print a one-time message (analogous to WARN_ONCE() et al): + */ +#define printk_once(x...) ({ \ + static int __print_once = 1; \ + \ + if (__print_once) { \ + __print_once = 0; \ + printk(x); \ + } \ +}) + #else static inline int vprintk(const char *s, va_list args) __attribute__ ((format (printf, 1, 0))); @@ -253,6 +266,10 @@ static inline int printk_ratelimit(void) static inline bool printk_timed_ratelimit(unsigned long *caller_jiffies, \ unsigned int interval_msec) \ { return false; } + +/* No effect, but we still get type checking even in the !PRINTK case: */ +#define printk_once(x...) printk(x) + #endif extern int printk_needs_cpu(int cpu); @@ -261,6 +278,12 @@ extern void printk_tick(void); extern void asmlinkage __attribute__((format(printf, 1, 2))) early_printk(const char *fmt, ...); +#ifdef CONFIG_PREEMPT_RT +extern void zap_rt_locks(void); +#else +# define zap_rt_locks() do { } while (0) +#endif + unsigned long int_sqrt(unsigned long); static inline void console_silent(void) @@ -289,6 +312,7 @@ extern int root_mountflags; /* Values used for system_state */ extern enum system_states { SYSTEM_BOOTING, + SYSTEM_BOOTING_SCHEDULER_OK, SYSTEM_RUNNING, SYSTEM_HALT, SYSTEM_POWER_OFF, Index: linux-2.6-tip/include/linux/kernel_stat.h =================================================================== --- linux-2.6-tip.orig/include/linux/kernel_stat.h +++ linux-2.6-tip/include/linux/kernel_stat.h @@ -23,12 +23,14 @@ struct cpu_usage_stat { cputime64_t idle; cputime64_t iowait; cputime64_t steal; + cputime64_t user_rt; + cputime64_t system_rt; cputime64_t guest; }; struct kernel_stat { struct cpu_usage_stat cpustat; -#ifndef CONFIG_SPARSE_IRQ +#ifndef CONFIG_GENERIC_HARDIRQS unsigned int irqs[NR_IRQS]; #endif }; @@ -41,7 +43,7 @@ DECLARE_PER_CPU(struct kernel_stat, ksta extern unsigned long long nr_context_switches(void); -#ifndef CONFIG_SPARSE_IRQ +#ifndef CONFIG_GENERIC_HARDIRQS #define kstat_irqs_this_cpu(irq) \ (kstat_this_cpu.irqs[irq]) @@ -52,16 +54,19 @@ static inline void kstat_incr_irqs_this_ { kstat_this_cpu.irqs[irq]++; } -#endif - -#ifndef CONFIG_SPARSE_IRQ static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu) { return kstat_cpu(cpu).irqs[irq]; } #else +#include extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu); +#define kstat_irqs_this_cpu(DESC) \ + ((DESC)->kstat_irqs[smp_processor_id()]) +#define kstat_incr_irqs_this_cpu(irqno, DESC) \ + ((DESC)->kstat_irqs[smp_processor_id()]++) + #endif /* @@ -78,7 +83,15 @@ static inline unsigned int kstat_irqs(un return sum; } + +/* + * Lock/unlock the current runqueue - to extract task statistics: + */ +extern void curr_rq_lock_irq_save(unsigned long *flags); +extern void curr_rq_unlock_irq_restore(unsigned long *flags); +extern unsigned long long __task_delta_exec(struct task_struct *tsk, int update); extern unsigned long long task_delta_exec(struct task_struct *); + extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); Index: linux-2.6-tip/include/linux/kmemcheck.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/linux/kmemcheck.h @@ -0,0 +1,147 @@ +#ifndef LINUX_KMEMCHECK_H +#define LINUX_KMEMCHECK_H + +#include +#include + +/* + * How to use: If you have a struct using bitfields, for example + * + * struct a { + * int x:8, y:8; + * }; + * + * then this should be rewritten as + * + * struct a { + * kmemcheck_define_bitfield(flags, { + * int x:8, y:8; + * }); + * }; + * + * Now the "flags" member may be used to refer to the bitfield (and things + * like &x.flags is allowed). As soon as the struct is allocated, the bit- + * fields should be annotated: + * + * struct a *a = kmalloc(sizeof(struct a), GFP_KERNEL); + * if (a) + * kmemcheck_annotate_bitfield(a->flags); + * + * Note: We provide the same definitions for both kmemcheck and non- + * kmemcheck kernels. This makes it harder to introduce accidental errors. + */ +#define kmemcheck_define_bitfield(name, fields...) \ + union { \ + struct fields name; \ + struct fields; \ + }; + +#ifdef CONFIG_KMEMCHECK +extern int kmemcheck_enabled; + +void kmemcheck_init(void); + +/* The slab-related functions. */ +void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node); +void kmemcheck_free_shadow(struct page *page, int order); +void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object, + size_t size); +void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size); + +void kmemcheck_pagealloc_alloc(struct page *p, unsigned int order, + gfp_t gfpflags); + +void kmemcheck_show_pages(struct page *p, unsigned int n); +void kmemcheck_hide_pages(struct page *p, unsigned int n); + +bool kmemcheck_page_is_tracked(struct page *p); + +void kmemcheck_mark_unallocated(void *address, unsigned int n); +void kmemcheck_mark_uninitialized(void *address, unsigned int n); +void kmemcheck_mark_initialized(void *address, unsigned int n); +void kmemcheck_mark_freed(void *address, unsigned int n); + +void kmemcheck_mark_unallocated_pages(struct page *p, unsigned int n); +void kmemcheck_mark_uninitialized_pages(struct page *p, unsigned int n); +void kmemcheck_mark_initialized_pages(struct page *p, unsigned int n); + +int kmemcheck_show_addr(unsigned long address); +int kmemcheck_hide_addr(unsigned long address); + +#define kmemcheck_annotate_bitfield(field) \ + do { \ + memset(&(field), 0, sizeof(field)); \ + } while (0) +#else +#define kmemcheck_enabled 0 + +static inline void kmemcheck_init(void) +{ +} + +static inline void +kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) +{ +} + +static inline void +kmemcheck_free_shadow(struct page *page, int order) +{ +} + +static inline void +kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object, + size_t size) +{ +} + +static inline void kmemcheck_slab_free(struct kmem_cache *s, void *object, + size_t size) +{ +} + +static inline void kmemcheck_pagealloc_alloc(struct page *p, + unsigned int order, gfp_t gfpflags) +{ +} + +static inline bool kmemcheck_page_is_tracked(struct page *p) +{ + return false; +} + +static inline void kmemcheck_mark_unallocated(void *address, unsigned int n) +{ +} + +static inline void kmemcheck_mark_uninitialized(void *address, unsigned int n) +{ +} + +static inline void kmemcheck_mark_initialized(void *address, unsigned int n) +{ +} + +static inline void kmemcheck_mark_freed(void *address, unsigned int n) +{ +} + +static inline void kmemcheck_mark_unallocated_pages(struct page *p, + unsigned int n) +{ +} + +static inline void kmemcheck_mark_uninitialized_pages(struct page *p, + unsigned int n) +{ +} + +static inline void kmemcheck_mark_initialized_pages(struct page *p, + unsigned int n) +{ +} + +#define kmemcheck_annotate_bitfield(field) do { } while (0) +#endif /* CONFIG_KMEMCHECK */ + +#endif /* LINUX_KMEMCHECK_H */ Index: linux-2.6-tip/include/linux/kprobes.h =================================================================== --- linux-2.6-tip.orig/include/linux/kprobes.h +++ linux-2.6-tip/include/linux/kprobes.h @@ -156,7 +156,7 @@ struct kretprobe { int nmissed; size_t data_size; struct hlist_head free_instances; - spinlock_t lock; + raw_spinlock_t lock; }; struct kretprobe_instance { @@ -182,6 +182,14 @@ struct kprobe_blackpoint { DECLARE_PER_CPU(struct kprobe *, current_kprobe); DECLARE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); +/* + * For #ifdef avoidance: + */ +static inline int kprobes_built_in(void) +{ + return 1; +} + #ifdef CONFIG_KRETPROBES extern void arch_prepare_kretprobe(struct kretprobe_instance *ri, struct pt_regs *regs); @@ -271,8 +279,16 @@ void unregister_kretprobes(struct kretpr void kprobe_flush_task(struct task_struct *tk); void recycle_rp_inst(struct kretprobe_instance *ri, struct hlist_head *head); -#else /* CONFIG_KPROBES */ +#else /* !CONFIG_KPROBES: */ +static inline int kprobes_built_in(void) +{ + return 0; +} +static inline int kprobe_fault_handler(struct pt_regs *regs, int trapnr) +{ + return 0; +} static inline struct kprobe *get_kprobe(void *addr) { return NULL; @@ -329,5 +345,5 @@ static inline void unregister_kretprobes static inline void kprobe_flush_task(struct task_struct *tk) { } -#endif /* CONFIG_KPROBES */ -#endif /* _LINUX_KPROBES_H */ +#endif /* CONFIG_KPROBES */ +#endif /* _LINUX_KPROBES_H */ Index: linux-2.6-tip/include/linux/latencytop.h =================================================================== --- linux-2.6-tip.orig/include/linux/latencytop.h +++ linux-2.6-tip/include/linux/latencytop.h @@ -9,6 +9,7 @@ #ifndef _INCLUDE_GUARD_LATENCYTOP_H_ #define _INCLUDE_GUARD_LATENCYTOP_H_ +#include #ifdef CONFIG_LATENCYTOP #define LT_SAVECOUNT 32 @@ -24,7 +25,14 @@ struct latency_record { struct task_struct; -void account_scheduler_latency(struct task_struct *task, int usecs, int inter); +extern int latencytop_enabled; +void __account_scheduler_latency(struct task_struct *task, int usecs, int inter); +static inline void +account_scheduler_latency(struct task_struct *task, int usecs, int inter) +{ + if (unlikely(latencytop_enabled)) + __account_scheduler_latency(task, usecs, inter); +} void clear_all_latency_tracing(struct task_struct *p); Index: linux-2.6-tip/include/linux/lockdep.h =================================================================== --- linux-2.6-tip.orig/include/linux/lockdep.h +++ linux-2.6-tip/include/linux/lockdep.h @@ -20,43 +20,10 @@ struct lockdep_map; #include /* - * Lock-class usage-state bits: + * We'd rather not expose kernel/lockdep_states.h this wide, but we do need + * the total number of states... :-( */ -enum lock_usage_bit -{ - LOCK_USED = 0, - LOCK_USED_IN_HARDIRQ, - LOCK_USED_IN_SOFTIRQ, - LOCK_ENABLED_SOFTIRQS, - LOCK_ENABLED_HARDIRQS, - LOCK_USED_IN_HARDIRQ_READ, - LOCK_USED_IN_SOFTIRQ_READ, - LOCK_ENABLED_SOFTIRQS_READ, - LOCK_ENABLED_HARDIRQS_READ, - LOCK_USAGE_STATES -}; - -/* - * Usage-state bitmasks: - */ -#define LOCKF_USED (1 << LOCK_USED) -#define LOCKF_USED_IN_HARDIRQ (1 << LOCK_USED_IN_HARDIRQ) -#define LOCKF_USED_IN_SOFTIRQ (1 << LOCK_USED_IN_SOFTIRQ) -#define LOCKF_ENABLED_HARDIRQS (1 << LOCK_ENABLED_HARDIRQS) -#define LOCKF_ENABLED_SOFTIRQS (1 << LOCK_ENABLED_SOFTIRQS) - -#define LOCKF_ENABLED_IRQS (LOCKF_ENABLED_HARDIRQS | LOCKF_ENABLED_SOFTIRQS) -#define LOCKF_USED_IN_IRQ (LOCKF_USED_IN_HARDIRQ | LOCKF_USED_IN_SOFTIRQ) - -#define LOCKF_USED_IN_HARDIRQ_READ (1 << LOCK_USED_IN_HARDIRQ_READ) -#define LOCKF_USED_IN_SOFTIRQ_READ (1 << LOCK_USED_IN_SOFTIRQ_READ) -#define LOCKF_ENABLED_HARDIRQS_READ (1 << LOCK_ENABLED_HARDIRQS_READ) -#define LOCKF_ENABLED_SOFTIRQS_READ (1 << LOCK_ENABLED_SOFTIRQS_READ) - -#define LOCKF_ENABLED_IRQS_READ \ - (LOCKF_ENABLED_HARDIRQS_READ | LOCKF_ENABLED_SOFTIRQS_READ) -#define LOCKF_USED_IN_IRQ_READ \ - (LOCKF_USED_IN_HARDIRQ_READ | LOCKF_USED_IN_SOFTIRQ_READ) +#define XXX_LOCK_USAGE_STATES (1+3*4) #define MAX_LOCKDEP_SUBCLASSES 8UL @@ -97,7 +64,7 @@ struct lock_class { * IRQ/softirq usage tracking bits: */ unsigned long usage_mask; - struct stack_trace usage_traces[LOCK_USAGE_STATES]; + struct stack_trace usage_traces[XXX_LOCK_USAGE_STATES]; /* * These fields represent a directed graph of lock dependencies, @@ -324,7 +291,11 @@ static inline void lock_set_subclass(str lock_set_class(lock, lock->name, lock->key, subclass, ip); } -# define INIT_LOCKDEP .lockdep_recursion = 0, +extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); +extern void lockdep_clear_current_reclaim_state(void); +extern void lockdep_trace_alloc(gfp_t mask); + +# define INIT_LOCKDEP .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0, #define lockdep_depth(tsk) (debug_locks ? (tsk)->lockdep_depth : 0) @@ -342,6 +313,9 @@ static inline void lockdep_on(void) # define lock_release(l, n, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i) do { } while (0) +# define lockdep_set_current_reclaim_state(g) do { } while (0) +# define lockdep_clear_current_reclaim_state() do { } while (0) +# define lockdep_trace_alloc(g) do { } while (0) # define lockdep_init() do { } while (0) # define lockdep_info() do { } while (0) # define lockdep_init_map(lock, name, key, sub) \ Index: linux-2.6-tip/include/linux/magic.h =================================================================== --- linux-2.6-tip.orig/include/linux/magic.h +++ linux-2.6-tip/include/linux/magic.h @@ -49,4 +49,5 @@ #define FUTEXFS_SUPER_MAGIC 0xBAD1DEA #define INOTIFYFS_SUPER_MAGIC 0x2BAD1DEA +#define STACK_END_MAGIC 0x57AC6E9D #endif /* __LINUX_MAGIC_H__ */ Index: linux-2.6-tip/include/linux/mca-legacy.h =================================================================== --- linux-2.6-tip.orig/include/linux/mca-legacy.h +++ linux-2.6-tip/include/linux/mca-legacy.h @@ -9,7 +9,7 @@ #include -#warning "MCA legacy - please move your driver to the new sysfs api" +/* #warning "MCA legacy - please move your driver to the new sysfs api" */ /* MCA_NOTFOUND is an error condition. The other two indicate * motherboard POS registers contain the adapter. They might be Index: linux-2.6-tip/include/linux/mm_types.h =================================================================== --- linux-2.6-tip.orig/include/linux/mm_types.h +++ linux-2.6-tip/include/linux/mm_types.h @@ -94,6 +94,14 @@ struct page { void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */ #endif /* WANT_PAGE_VIRTUAL */ + +#ifdef CONFIG_KMEMCHECK + /* + * kmemcheck wants to track the status of each byte in a page; this + * is a pointer to such a status block. NULL if not tracked. + */ + void *shadow; +#endif }; /* @@ -233,6 +241,9 @@ struct mm_struct { /* Architecture-specific MM context */ mm_context_t context; + /* realtime bits */ + struct list_head delayed_drop; + /* Swap token stuff */ /* * Last value of global fault stamp as seen by this process. Index: linux-2.6-tip/include/linux/mmiotrace.h =================================================================== --- linux-2.6-tip.orig/include/linux/mmiotrace.h +++ linux-2.6-tip/include/linux/mmiotrace.h @@ -1,5 +1,5 @@ -#ifndef MMIOTRACE_H -#define MMIOTRACE_H +#ifndef _LINUX_MMIOTRACE_H +#define _LINUX_MMIOTRACE_H #include #include @@ -13,28 +13,34 @@ typedef void (*kmmio_post_handler_t)(str unsigned long condition, struct pt_regs *); struct kmmio_probe { - struct list_head list; /* kmmio internal list */ - unsigned long addr; /* start location of the probe point */ - unsigned long len; /* length of the probe region */ - kmmio_pre_handler_t pre_handler; /* Called before addr is executed. */ - kmmio_post_handler_t post_handler; /* Called after addr is executed */ - void *private; + /* kmmio internal list: */ + struct list_head list; + /* start location of the probe point: */ + unsigned long addr; + /* length of the probe region: */ + unsigned long len; + /* Called before addr is executed: */ + kmmio_pre_handler_t pre_handler; + /* Called after addr is executed: */ + kmmio_post_handler_t post_handler; + void *private; }; +extern unsigned int kmmio_count; + +extern int register_kmmio_probe(struct kmmio_probe *p); +extern void unregister_kmmio_probe(struct kmmio_probe *p); + +#ifdef CONFIG_MMIOTRACE /* kmmio is active by some kmmio_probes? */ static inline int is_kmmio_active(void) { - extern unsigned int kmmio_count; return kmmio_count; } -extern int register_kmmio_probe(struct kmmio_probe *p); -extern void unregister_kmmio_probe(struct kmmio_probe *p); - /* Called from page fault handler. */ extern int kmmio_handler(struct pt_regs *regs, unsigned long addr); -#ifdef CONFIG_MMIOTRACE /* Called from ioremap.c */ extern void mmiotrace_ioremap(resource_size_t offset, unsigned long size, void __iomem *addr); @@ -43,7 +49,17 @@ extern void mmiotrace_iounmap(volatile v /* For anyone to insert markers. Remember trailing newline. */ extern int mmiotrace_printk(const char *fmt, ...) __attribute__ ((format (printf, 1, 2))); -#else +#else /* !CONFIG_MMIOTRACE: */ +static inline int is_kmmio_active(void) +{ + return 0; +} + +static inline int kmmio_handler(struct pt_regs *regs, unsigned long addr) +{ + return 0; +} + static inline void mmiotrace_ioremap(resource_size_t offset, unsigned long size, void __iomem *addr) { @@ -63,28 +79,28 @@ static inline int mmiotrace_printk(const #endif /* CONFIG_MMIOTRACE */ enum mm_io_opcode { - MMIO_READ = 0x1, /* struct mmiotrace_rw */ - MMIO_WRITE = 0x2, /* struct mmiotrace_rw */ - MMIO_PROBE = 0x3, /* struct mmiotrace_map */ - MMIO_UNPROBE = 0x4, /* struct mmiotrace_map */ - MMIO_UNKNOWN_OP = 0x5, /* struct mmiotrace_rw */ + MMIO_READ = 0x1, /* struct mmiotrace_rw */ + MMIO_WRITE = 0x2, /* struct mmiotrace_rw */ + MMIO_PROBE = 0x3, /* struct mmiotrace_map */ + MMIO_UNPROBE = 0x4, /* struct mmiotrace_map */ + MMIO_UNKNOWN_OP = 0x5, /* struct mmiotrace_rw */ }; struct mmiotrace_rw { - resource_size_t phys; /* PCI address of register */ - unsigned long value; - unsigned long pc; /* optional program counter */ - int map_id; - unsigned char opcode; /* one of MMIO_{READ,WRITE,UNKNOWN_OP} */ - unsigned char width; /* size of register access in bytes */ + resource_size_t phys; /* PCI address of register */ + unsigned long value; + unsigned long pc; /* optional program counter */ + int map_id; + unsigned char opcode; /* one of MMIO_{READ,WRITE,UNKNOWN_OP} */ + unsigned char width; /* size of register access in bytes */ }; struct mmiotrace_map { - resource_size_t phys; /* base address in PCI space */ - unsigned long virt; /* base virtual address */ - unsigned long len; /* mapping size */ - int map_id; - unsigned char opcode; /* MMIO_PROBE or MMIO_UNPROBE */ + resource_size_t phys; /* base address in PCI space */ + unsigned long virt; /* base virtual address */ + unsigned long len; /* mapping size */ + int map_id; + unsigned char opcode; /* MMIO_PROBE or MMIO_UNPROBE */ }; /* in kernel/trace/trace_mmiotrace.c */ @@ -94,4 +110,4 @@ extern void mmio_trace_rw(struct mmiotra extern void mmio_trace_mapping(struct mmiotrace_map *map); extern int mmio_trace_printk(const char *fmt, va_list args); -#endif /* MMIOTRACE_H */ +#endif /* _LINUX_MMIOTRACE_H */ Index: linux-2.6-tip/include/linux/module.h =================================================================== --- linux-2.6-tip.orig/include/linux/module.h +++ linux-2.6-tip/include/linux/module.h @@ -78,18 +78,34 @@ void sort_extable(struct exception_table struct exception_table_entry *finish); void sort_main_extable(void); +/* + * Return a pointer to the current module, but only if within a module + */ #ifdef MODULE -#define MODULE_GENERIC_TABLE(gtype,name) \ -extern const struct gtype##_id __mod_##gtype##_table \ - __attribute__ ((unused, alias(__stringify(name)))) - extern struct module __this_module; #define THIS_MODULE (&__this_module) #else /* !MODULE */ -#define MODULE_GENERIC_TABLE(gtype,name) #define THIS_MODULE ((struct module *)0) #endif +/* + * Declare a module table + * - this suppresses "'name' defined but not used" warnings from the compiler + * as the table may not actually be used by the code within the module + */ +#ifdef MODULE +#define MODULE_GENERIC_TABLE(gtype,name) \ +extern const struct gtype##_id __mod_##gtype##_table \ + __attribute__ ((unused, alias(__stringify(name)))) +#define MODULE_STATIC_GENERIC_TABLE(gtype,name) \ +extern const struct gtype##_id __mod_##gtype##_table \ + __attribute__ ((unused, alias(__stringify(name)))) +#else +#define MODULE_GENERIC_TABLE(gtype,name) +#define MODULE_STATIC_GENERIC_TABLE(gtype,name) \ +static __typeof__((name)) name __attribute__((unused)); +#endif + /* Generic info of form tag = "info" */ #define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info) @@ -139,6 +155,8 @@ extern struct module __this_module; #define MODULE_DEVICE_TABLE(type,name) \ MODULE_GENERIC_TABLE(type##_device,name) +#define MODULE_STATIC_DEVICE_TABLE(type,name) \ + MODULE_STATIC_GENERIC_TABLE(type##_device,name) /* Version of form [:][-]. Or for CVS/RCS ID version, everything but the number is stripped. Index: linux-2.6-tip/include/linux/mutex.h =================================================================== --- linux-2.6-tip.orig/include/linux/mutex.h +++ linux-2.6-tip/include/linux/mutex.h @@ -12,11 +12,82 @@ #include #include +#include #include #include #include +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ + , .dep_map = { .name = #lockname } +#else +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) +#endif + +#ifdef CONFIG_PREEMPT_RT + +#include + +struct mutex { + struct rt_mutex lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + + +#define __MUTEX_INITIALIZER(mutexname) \ + { \ + .lock = __RT_MUTEX_INITIALIZER(mutexname.lock) \ + __DEP_MAP_MUTEX_INITIALIZER(mutexname) \ + } + +#define DEFINE_MUTEX(mutexname) \ + struct mutex mutexname = __MUTEX_INITIALIZER(mutexname) + +extern void +__mutex_init(struct mutex *lock, char *name, struct lock_class_key *key); + +extern void __lockfunc _mutex_lock(struct mutex *lock); +extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock); +extern int __lockfunc _mutex_lock_killable(struct mutex *lock); +extern void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_lock_killable_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_trylock(struct mutex *lock); +extern void __lockfunc _mutex_unlock(struct mutex *lock); + +#define mutex_is_locked(l) rt_mutex_is_locked(&(l)->lock) +#define mutex_lock(l) _mutex_lock(l) +#define mutex_lock_interruptible(l) _mutex_lock_interruptible(l) +#define mutex_lock_killable(l) _mutex_lock_killable(l) +#define mutex_trylock(l) _mutex_trylock(l) +#define mutex_unlock(l) _mutex_unlock(l) +#define mutex_destroy(l) rt_mutex_destroy(&(l)->lock) + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define mutex_lock_nested(l, s) _mutex_lock_nested(l, s) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible_nested(l, s) +# define mutex_lock_killable_nested(l, s) \ + _mutex_lock_killable_nested(l, s) +#else +# define mutex_lock_nested(l, s) _mutex_lock(l) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible(l) +# define mutex_lock_killable_nested(l, s) \ + _mutex_lock_killable(l) +#endif + +# define mutex_init(mutex) \ +do { \ + static struct lock_class_key __key; \ + \ + __mutex_init((mutex), #mutex, &__key); \ +} while (0) + +#else /* * Simple, straightforward mutexes with strict semantics: * @@ -50,8 +121,10 @@ struct mutex { atomic_t count; spinlock_t wait_lock; struct list_head wait_list; -#ifdef CONFIG_DEBUG_MUTEXES +#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP) struct thread_info *owner; +#endif +#ifdef CONFIG_DEBUG_MUTEXES const char *name; void *magic; #endif @@ -68,7 +141,6 @@ struct mutex_waiter { struct list_head list; struct task_struct *task; #ifdef CONFIG_DEBUG_MUTEXES - struct mutex *lock; void *magic; #endif }; @@ -86,13 +158,6 @@ do { \ # define mutex_destroy(mutex) do { } while (0) #endif -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ - , .dep_map = { .name = #lockname } -#else -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) -#endif - #define __MUTEX_INITIALIZER(lockname) \ { .count = ATOMIC_INIT(1) \ , .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \ @@ -151,3 +216,4 @@ extern int mutex_trylock(struct mutex *l extern void mutex_unlock(struct mutex *lock); #endif +#endif Index: linux-2.6-tip/include/linux/nubus.h =================================================================== --- linux-2.6-tip.orig/include/linux/nubus.h +++ linux-2.6-tip/include/linux/nubus.h @@ -237,6 +237,7 @@ struct nubus_dirent int mask; }; +#ifdef __KERNEL__ struct nubus_board { struct nubus_board* next; struct nubus_dev* first_dev; @@ -351,6 +352,7 @@ void nubus_get_rsrc_mem(void* dest, void nubus_get_rsrc_str(void* dest, const struct nubus_dirent *dirent, int maxlen); +#endif /* __KERNEL__ */ /* We'd like to get rid of this eventually. Only daynaport.c uses it now. */ static inline void *nubus_slot_addr(int slot) Index: linux-2.6-tip/include/linux/pci.h =================================================================== --- linux-2.6-tip.orig/include/linux/pci.h +++ linux-2.6-tip/include/linux/pci.h @@ -923,7 +923,10 @@ static inline struct pci_dev *pci_get_cl return NULL; } -#define pci_dev_present(ids) (0) +static inline int pci_dev_present(const struct pci_device_id *ids) +{ + return 0; +} #define no_pci_devices() (1) #define pci_dev_put(dev) do { } while (0) Index: linux-2.6-tip/include/linux/percpu.h =================================================================== --- linux-2.6-tip.orig/include/linux/percpu.h +++ linux-2.6-tip/include/linux/percpu.h @@ -8,38 +8,59 @@ #include +#ifndef PER_CPU_BASE_SECTION +#ifdef CONFIG_SMP +#define PER_CPU_BASE_SECTION ".data.percpu" +#else +#define PER_CPU_BASE_SECTION ".data" +#endif +#endif + #ifdef CONFIG_SMP -#define DEFINE_PER_CPU(type, name) \ - __attribute__((__section__(".data.percpu"))) \ - PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name #ifdef MODULE -#define SHARED_ALIGNED_SECTION ".data.percpu" +#define PER_CPU_SHARED_ALIGNED_SECTION "" #else -#define SHARED_ALIGNED_SECTION ".data.percpu.shared_aligned" +#define PER_CPU_SHARED_ALIGNED_SECTION ".shared_aligned" #endif +#define PER_CPU_FIRST_SECTION ".first" -#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ - __attribute__((__section__(SHARED_ALIGNED_SECTION))) \ - PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name \ - ____cacheline_aligned_in_smp +#else + +#define PER_CPU_SHARED_ALIGNED_SECTION "" +#define PER_CPU_FIRST_SECTION "" + +#endif -#define DEFINE_PER_CPU_PAGE_ALIGNED(type, name) \ - __attribute__((__section__(".data.percpu.page_aligned"))) \ +#define DEFINE_PER_CPU_SECTION(type, name, section) \ + __attribute__((__section__(PER_CPU_BASE_SECTION section))) \ PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name -#else + +#define DEFINE_PER_CPU_SPINLOCK(name, section) \ + __attribute__((__section__(PER_CPU_BASE_SECTION section))) \ + PER_CPU_ATTRIBUTES __DEFINE_SPINLOCK(per_cpu__lock_##name##_locked); + #define DEFINE_PER_CPU(type, name) \ - PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name + DEFINE_PER_CPU_SECTION(type, name, "") -#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ - DEFINE_PER_CPU(type, name) +#define DEFINE_PER_CPU_LOCKED(type, name) \ + DEFINE_PER_CPU_SPINLOCK(name, "") \ + DEFINE_PER_CPU_SECTION(type, name##_locked, "") -#define DEFINE_PER_CPU_PAGE_ALIGNED(type, name) \ - DEFINE_PER_CPU(type, name) -#endif +#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ + ____cacheline_aligned_in_smp + +#define DEFINE_PER_CPU_PAGE_ALIGNED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, ".page_aligned") + +#define DEFINE_PER_CPU_FIRST(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, PER_CPU_FIRST_SECTION) #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var##_locked) #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var##_locked) /* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */ #ifndef PERCPU_ENOUGH_ROOM @@ -63,6 +84,29 @@ &__get_cpu_var(var); })) #define put_cpu_var(var) preempt_enable() +/* + * Per-CPU data structures with an additional lock - useful for + * PREEMPT_RT code that wants to reschedule but also wants + * per-CPU data structures. + * + * 'cpu' gets updated with the CPU the task is currently executing on. + * + * NOTE: on normal !PREEMPT_RT kernels these per-CPU variables + * are the same as the normal per-CPU variables, so there no + * runtime overhead. + */ +#define get_cpu_var_locked(var, cpuptr) \ +(*({ \ + int __cpu = raw_smp_processor_id(); \ + \ + *(cpuptr) = __cpu; \ + spin_lock(&__get_cpu_lock(var, __cpu)); \ + &__get_cpu_var_locked(var, __cpu); \ +})) + +#define put_cpu_var_locked(var, cpu) \ + do { (void)cpu; spin_unlock(&__get_cpu_lock(var, cpu)); } while (0) + #ifdef CONFIG_SMP struct percpu_data { Index: linux-2.6-tip/include/linux/perf_counter.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/linux/perf_counter.h @@ -0,0 +1,296 @@ +/* + * Performance counters: + * + * Copyright(C) 2008, Thomas Gleixner + * Copyright(C) 2008, Red Hat, Inc., Ingo Molnar + * + * Data type definitions, declarations, prototypes. + * + * Started by: Thomas Gleixner and Ingo Molnar + * + * For licencing details see kernel-base/COPYING + */ +#ifndef _LINUX_PERF_COUNTER_H +#define _LINUX_PERF_COUNTER_H + +#include +#include + +#ifdef CONFIG_PERF_COUNTERS +# include +#endif + +#include +#include +#include +#include +#include + +struct task_struct; + +/* + * User-space ABI bits: + */ + +/* + * Generalized performance counter event types, used by the hw_event.type + * parameter of the sys_perf_counter_open() syscall: + */ +enum hw_event_types { + /* + * Common hardware events, generalized by the kernel: + */ + PERF_COUNT_CPU_CYCLES = 0, + PERF_COUNT_INSTRUCTIONS = 1, + PERF_COUNT_CACHE_REFERENCES = 2, + PERF_COUNT_CACHE_MISSES = 3, + PERF_COUNT_BRANCH_INSTRUCTIONS = 4, + PERF_COUNT_BRANCH_MISSES = 5, + PERF_COUNT_BUS_CYCLES = 6, + + PERF_HW_EVENTS_MAX = 7, + + /* + * Special "software" counters provided by the kernel, even if + * the hardware does not support performance counters. These + * counters measure various physical and sw events of the + * kernel (and allow the profiling of them as well): + */ + PERF_COUNT_CPU_CLOCK = -1, + PERF_COUNT_TASK_CLOCK = -2, + PERF_COUNT_PAGE_FAULTS = -3, + PERF_COUNT_CONTEXT_SWITCHES = -4, + PERF_COUNT_CPU_MIGRATIONS = -5, + + PERF_SW_EVENTS_MIN = -6, +}; + +/* + * IRQ-notification data record type: + */ +enum perf_counter_record_type { + PERF_RECORD_SIMPLE = 0, + PERF_RECORD_IRQ = 1, + PERF_RECORD_GROUP = 2, +}; + +/* + * Hardware event to monitor via a performance monitoring counter: + */ +struct perf_counter_hw_event { + s64 type; + + u64 irq_period; + u32 record_type; + + u32 disabled : 1, /* off by default */ + nmi : 1, /* NMI sampling */ + raw : 1, /* raw event type */ + inherit : 1, /* children inherit it */ + pinned : 1, /* must always be on PMU */ + exclusive : 1, /* only group on PMU */ + exclude_user : 1, /* don't count user */ + exclude_kernel : 1, /* ditto kernel */ + exclude_hv : 1, /* ditto hypervisor */ + + __reserved_1 : 23; + + u64 __reserved_2; +}; + +/* + * Ioctls that can be done on a perf counter fd: + */ +#define PERF_COUNTER_IOC_ENABLE _IO('$', 0) +#define PERF_COUNTER_IOC_DISABLE _IO('$', 1) + +/* + * Kernel-internal data types: + */ + +/** + * struct hw_perf_counter - performance counter hardware details: + */ +struct hw_perf_counter { +#ifdef CONFIG_PERF_COUNTERS + u64 config; + unsigned long config_base; + unsigned long counter_base; + int nmi; + unsigned int idx; + atomic64_t prev_count; + u64 irq_period; + atomic64_t period_left; +#endif +}; + +/* + * Hardcoded buffer length limit for now, for IRQ-fed events: + */ +#define PERF_DATA_BUFLEN 2048 + +/** + * struct perf_data - performance counter IRQ data sampling ... + */ +struct perf_data { + int len; + int rd_idx; + int overrun; + u8 data[PERF_DATA_BUFLEN]; +}; + +struct perf_counter; + +/** + * struct hw_perf_counter_ops - performance counter hw ops + */ +struct hw_perf_counter_ops { + int (*enable) (struct perf_counter *counter); + void (*disable) (struct perf_counter *counter); + void (*read) (struct perf_counter *counter); +}; + +/** + * enum perf_counter_active_state - the states of a counter + */ +enum perf_counter_active_state { + PERF_COUNTER_STATE_ERROR = -2, + PERF_COUNTER_STATE_OFF = -1, + PERF_COUNTER_STATE_INACTIVE = 0, + PERF_COUNTER_STATE_ACTIVE = 1, +}; + +struct file; + +/** + * struct perf_counter - performance counter kernel representation: + */ +struct perf_counter { +#ifdef CONFIG_PERF_COUNTERS + struct list_head list_entry; + struct list_head sibling_list; + struct perf_counter *group_leader; + const struct hw_perf_counter_ops *hw_ops; + + enum perf_counter_active_state state; + enum perf_counter_active_state prev_state; + atomic64_t count; + + struct perf_counter_hw_event hw_event; + struct hw_perf_counter hw; + + struct perf_counter_context *ctx; + struct task_struct *task; + struct file *filp; + + struct perf_counter *parent; + struct list_head child_list; + + /* + * Protect attach/detach and child_list: + */ + struct mutex mutex; + + int oncpu; + int cpu; + + /* read() / irq related data */ + wait_queue_head_t waitq; + /* optional: for NMIs */ + int wakeup_pending; + struct perf_data *irqdata; + struct perf_data *usrdata; + struct perf_data data[2]; +#endif +}; + +/** + * struct perf_counter_context - counter context structure + * + * Used as a container for task counters and CPU counters as well: + */ +struct perf_counter_context { +#ifdef CONFIG_PERF_COUNTERS + /* + * Protect the states of the counters in the list, + * nr_active, and the list: + */ + raw_spinlock_t lock; + /* + * Protect the list of counters. Locking either mutex or lock + * is sufficient to ensure the list doesn't change; to change + * the list you need to lock both the mutex and the spinlock. + */ + struct mutex mutex; + + struct list_head counter_list; + int nr_counters; + int nr_active; + int is_active; + struct task_struct *task; +#endif +}; + +/** + * struct perf_counter_cpu_context - per cpu counter context structure + */ +struct perf_cpu_context { + struct perf_counter_context ctx; + struct perf_counter_context *task_ctx; + int active_oncpu; + int max_pertask; + int exclusive; +}; + +/* + * Set by architecture code: + */ +extern int perf_max_counters; + +#ifdef CONFIG_PERF_COUNTERS +extern const struct hw_perf_counter_ops * +hw_perf_counter_init(struct perf_counter *counter); + +extern void perf_counter_task_sched_in(struct task_struct *task, int cpu); +extern void perf_counter_task_sched_out(struct task_struct *task, int cpu); +extern void perf_counter_task_tick(struct task_struct *task, int cpu); +extern void perf_counter_init_task(struct task_struct *child); +extern void perf_counter_exit_task(struct task_struct *child); +extern void perf_counter_notify(struct pt_regs *regs); +extern void perf_counter_print_debug(void); +extern void perf_counter_unthrottle(void); +extern u64 hw_perf_save_disable(void); +extern void hw_perf_restore(u64 ctrl); +extern int perf_counter_task_disable(void); +extern int perf_counter_task_enable(void); +extern int hw_perf_group_sched_in(struct perf_counter *group_leader, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx, int cpu); + +/* + * Return 1 for a software counter, 0 for a hardware counter + */ +static inline int is_software_counter(struct perf_counter *counter) +{ + return !counter->hw_event.raw && counter->hw_event.type < 0; +} + +#else +static inline void +perf_counter_task_sched_in(struct task_struct *task, int cpu) { } +static inline void +perf_counter_task_sched_out(struct task_struct *task, int cpu) { } +static inline void +perf_counter_task_tick(struct task_struct *task, int cpu) { } +static inline void perf_counter_init_task(struct task_struct *child) { } +static inline void perf_counter_exit_task(struct task_struct *child) { } +static inline void perf_counter_notify(struct pt_regs *regs) { } +static inline void perf_counter_print_debug(void) { } +static inline void perf_counter_unthrottle(void) { } +static inline void hw_perf_restore(u64 ctrl) { } +static inline u64 hw_perf_save_disable(void) { return 0; } +static inline int perf_counter_task_disable(void) { return -EINVAL; } +static inline int perf_counter_task_enable(void) { return -EINVAL; } +#endif + +#endif /* _LINUX_PERF_COUNTER_H */ Index: linux-2.6-tip/include/linux/pipe_fs_i.h =================================================================== --- linux-2.6-tip.orig/include/linux/pipe_fs_i.h +++ linux-2.6-tip/include/linux/pipe_fs_i.h @@ -1,9 +1,9 @@ #ifndef _LINUX_PIPE_FS_I_H #define _LINUX_PIPE_FS_I_H -#define PIPEFS_MAGIC 0x50495045 +#define PIPEFS_MAGIC 0x50495045 -#define PIPE_BUFFERS (16) +#define PIPE_BUFFERS 64 #define PIPE_BUF_FLAG_LRU 0x01 /* page is on the LRU */ #define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */ Index: linux-2.6-tip/include/linux/plist.h =================================================================== --- linux-2.6-tip.orig/include/linux/plist.h +++ linux-2.6-tip/include/linux/plist.h @@ -81,7 +81,7 @@ struct plist_head { struct list_head prio_list; struct list_head node_list; #ifdef CONFIG_DEBUG_PI_LIST - spinlock_t *lock; + raw_spinlock_t *lock; #endif }; @@ -96,16 +96,19 @@ struct plist_node { # define PLIST_HEAD_LOCK_INIT(_lock) #endif +#define _PLIST_HEAD_INIT(head) \ + .prio_list = LIST_HEAD_INIT((head).prio_list), \ + .node_list = LIST_HEAD_INIT((head).node_list) + /** * PLIST_HEAD_INIT - static struct plist_head initializer * @head: struct plist_head variable name - * @_lock: lock to initialize for this list + * @_lock: lock * to initialize for this list */ #define PLIST_HEAD_INIT(head, _lock) \ { \ - .prio_list = LIST_HEAD_INIT((head).prio_list), \ - .node_list = LIST_HEAD_INIT((head).node_list), \ - PLIST_HEAD_LOCK_INIT(&(_lock)) \ + _PLIST_HEAD_INIT(head), \ + PLIST_HEAD_LOCK_INIT(_lock) \ } /** @@ -116,7 +119,7 @@ struct plist_node { #define PLIST_NODE_INIT(node, __prio) \ { \ .prio = (__prio), \ - .plist = PLIST_HEAD_INIT((node).plist, NULL), \ + .plist = { _PLIST_HEAD_INIT((node).plist) }, \ } /** @@ -125,7 +128,7 @@ struct plist_node { * @lock: list spinlock, remembered for debugging */ static inline void -plist_head_init(struct plist_head *head, spinlock_t *lock) +plist_head_init(struct plist_head *head, raw_spinlock_t *lock) { INIT_LIST_HEAD(&head->prio_list); INIT_LIST_HEAD(&head->node_list); Index: linux-2.6-tip/include/linux/poison.h =================================================================== --- linux-2.6-tip.orig/include/linux/poison.h +++ linux-2.6-tip/include/linux/poison.h @@ -2,13 +2,25 @@ #define _LINUX_POISON_H /********** include/linux/list.h **********/ + +/* + * Architectures might want to move the poison pointer offset + * into some well-recognized area such as 0xdead000000000000, + * that is also not mappable by user-space exploits: + */ +#ifdef CONFIG_ILLEGAL_POINTER_VALUE +# define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL) +#else +# define POISON_POINTER_DELTA 0 +#endif + /* * These are non-NULL pointers that will result in page faults * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100) -#define LIST_POISON2 ((void *) 0x00200200) +#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) +#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) /********** include/linux/timer.h **********/ /* Index: linux-2.6-tip/include/linux/prctl.h =================================================================== --- linux-2.6-tip.orig/include/linux/prctl.h +++ linux-2.6-tip/include/linux/prctl.h @@ -85,4 +85,7 @@ #define PR_SET_TIMERSLACK 29 #define PR_GET_TIMERSLACK 30 +#define PR_TASK_PERF_COUNTERS_DISABLE 31 +#define PR_TASK_PERF_COUNTERS_ENABLE 32 + #endif /* _LINUX_PRCTL_H */ Index: linux-2.6-tip/include/linux/reiserfs_fs.h =================================================================== --- linux-2.6-tip.orig/include/linux/reiserfs_fs.h +++ linux-2.6-tip/include/linux/reiserfs_fs.h @@ -28,8 +28,6 @@ #include #endif -struct fid; - /* * include/linux/reiser_fs.h * @@ -37,6 +35,33 @@ struct fid; * */ +/* ioctl's command */ +#define REISERFS_IOC_UNPACK _IOW(0xCD,1,long) +/* define following flags to be the same as in ext2, so that chattr(1), + lsattr(1) will work with us. */ +#define REISERFS_IOC_GETFLAGS FS_IOC_GETFLAGS +#define REISERFS_IOC_SETFLAGS FS_IOC_SETFLAGS +#define REISERFS_IOC_GETVERSION FS_IOC_GETVERSION +#define REISERFS_IOC_SETVERSION FS_IOC_SETVERSION + +#ifdef __KERNEL__ +/* the 32 bit compat definitions with int argument */ +#define REISERFS_IOC32_UNPACK _IOW(0xCD, 1, int) +#define REISERFS_IOC32_GETFLAGS FS_IOC32_GETFLAGS +#define REISERFS_IOC32_SETFLAGS FS_IOC32_SETFLAGS +#define REISERFS_IOC32_GETVERSION FS_IOC32_GETVERSION +#define REISERFS_IOC32_SETVERSION FS_IOC32_SETVERSION + +/* Locking primitives */ +/* Right now we are still falling back to (un)lock_kernel, but eventually that + would evolve into real per-fs locks */ +#define reiserfs_write_lock( sb ) lock_kernel() +#define reiserfs_write_unlock( sb ) unlock_kernel() + +/* xattr stuff */ +#define REISERFS_XATTR_DIR_SEM(s) (REISERFS_SB(s)->xattr_dir_sem) +struct fid; + /* in reading the #defines, it may help to understand that they employ the following abbreviations: @@ -698,6 +723,7 @@ static inline void cpu_key_k_offset_dec( /* object identifier for root dir */ #define REISERFS_ROOT_OBJECTID 2 #define REISERFS_ROOT_PARENT_OBJECTID 1 + extern struct reiserfs_key root_key; /* @@ -1540,7 +1566,6 @@ struct reiserfs_iget_args { /* FUNCTION DECLARATIONS */ /***************************************************************************/ -/*#ifdef __KERNEL__*/ #define get_journal_desc_magic(bh) (bh->b_data + bh->b_size - 12) #define journal_trans_half(blocksize) \ @@ -2178,29 +2203,6 @@ long reiserfs_compat_ioctl(struct file * unsigned int cmd, unsigned long arg); int reiserfs_unpack(struct inode *inode, struct file *filp); -/* ioctl's command */ -#define REISERFS_IOC_UNPACK _IOW(0xCD,1,long) -/* define following flags to be the same as in ext2, so that chattr(1), - lsattr(1) will work with us. */ -#define REISERFS_IOC_GETFLAGS FS_IOC_GETFLAGS -#define REISERFS_IOC_SETFLAGS FS_IOC_SETFLAGS -#define REISERFS_IOC_GETVERSION FS_IOC_GETVERSION -#define REISERFS_IOC_SETVERSION FS_IOC_SETVERSION - -/* the 32 bit compat definitions with int argument */ -#define REISERFS_IOC32_UNPACK _IOW(0xCD, 1, int) -#define REISERFS_IOC32_GETFLAGS FS_IOC32_GETFLAGS -#define REISERFS_IOC32_SETFLAGS FS_IOC32_SETFLAGS -#define REISERFS_IOC32_GETVERSION FS_IOC32_GETVERSION -#define REISERFS_IOC32_SETVERSION FS_IOC32_SETVERSION - -/* Locking primitives */ -/* Right now we are still falling back to (un)lock_kernel, but eventually that - would evolve into real per-fs locks */ -#define reiserfs_write_lock( sb ) lock_kernel() -#define reiserfs_write_unlock( sb ) unlock_kernel() - -/* xattr stuff */ -#define REISERFS_XATTR_DIR_SEM(s) (REISERFS_SB(s)->xattr_dir_sem) +#endif /* __KERNEL__ */ #endif /* _LINUX_REISER_FS_H */ Index: linux-2.6-tip/include/linux/ring_buffer.h =================================================================== --- linux-2.6-tip.orig/include/linux/ring_buffer.h +++ linux-2.6-tip/include/linux/ring_buffer.h @@ -8,7 +8,7 @@ struct ring_buffer; struct ring_buffer_iter; /* - * Don't reference this struct directly, use functions below. + * Don't refer to this struct directly, use functions below. */ struct ring_buffer_event { u32 type:2, len:3, time_delta:27; @@ -74,13 +74,10 @@ void ring_buffer_free(struct ring_buffer int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size); -struct ring_buffer_event * -ring_buffer_lock_reserve(struct ring_buffer *buffer, - unsigned long length, - unsigned long *flags); +struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer, + unsigned long length); int ring_buffer_unlock_commit(struct ring_buffer *buffer, - struct ring_buffer_event *event, - unsigned long flags); + struct ring_buffer_event *event); int ring_buffer_write(struct ring_buffer *buffer, unsigned long length, void *data); @@ -124,9 +121,20 @@ unsigned long ring_buffer_overrun_cpu(st u64 ring_buffer_time_stamp(int cpu); void ring_buffer_normalize_time_stamp(int cpu, u64 *ts); +/* + * The below functions are fine to use outside the tracing facility. + */ +#ifdef CONFIG_RING_BUFFER void tracing_on(void); void tracing_off(void); void tracing_off_permanent(void); +int tracing_is_on(void); +#else +static inline void tracing_on(void) { } +static inline void tracing_off(void) { } +static inline void tracing_off_permanent(void) { } +static inline int tracing_is_on(void) { return 0; } +#endif void *ring_buffer_alloc_read_page(struct ring_buffer *buffer); void ring_buffer_free_read_page(struct ring_buffer *buffer, void *data); Index: linux-2.6-tip/include/linux/sched.h =================================================================== --- linux-2.6-tip.orig/include/linux/sched.h +++ linux-2.6-tip/include/linux/sched.h @@ -71,6 +71,7 @@ struct sched_param { #include #include #include +#include #include #include #include @@ -91,6 +92,28 @@ struct sched_param { #include +#ifdef CONFIG_PREEMPT +extern int kernel_preemption; +#else +# define kernel_preemption 0 +#endif +#ifdef CONFIG_PREEMPT_VOLUNTARY +extern int voluntary_preemption; +#else +# define voluntary_preemption 0 +#endif +#ifdef CONFIG_PREEMPT_SOFTIRQS +extern int softirq_preemption; +#else +# define softirq_preemption 0 +#endif + +#ifdef CONFIG_PREEMPT_HARDIRQS +extern int hardirq_preemption; +#else +# define hardirq_preemption 0 +#endif + struct mem_cgroup; struct exec_domain; struct futex_pi_state; @@ -136,6 +159,10 @@ extern unsigned long nr_running(void); extern unsigned long nr_uninterruptible(void); extern unsigned long nr_active(void); extern unsigned long nr_iowait(void); +extern u64 cpu_nr_switches(int cpu); +extern u64 cpu_nr_migrations(int cpu); + +extern unsigned long get_parent_ip(unsigned long addr); struct seq_file; struct cfs_rq; @@ -160,6 +187,7 @@ print_cfs_rq(struct seq_file *m, int cpu #endif extern unsigned long long time_sync_thresh; +extern struct semaphore kernel_sem; /* * Task state bitmask. NOTE! These bits are also @@ -172,16 +200,17 @@ extern unsigned long long time_sync_thre * mistake. */ #define TASK_RUNNING 0 -#define TASK_INTERRUPTIBLE 1 -#define TASK_UNINTERRUPTIBLE 2 -#define __TASK_STOPPED 4 -#define __TASK_TRACED 8 +#define TASK_RUNNING_MUTEX 1 +#define TASK_INTERRUPTIBLE 2 +#define TASK_UNINTERRUPTIBLE 4 +#define __TASK_STOPPED 8 +#define __TASK_TRACED 16 /* in tsk->exit_state */ -#define EXIT_ZOMBIE 16 -#define EXIT_DEAD 32 +#define EXIT_ZOMBIE 32 +#define EXIT_DEAD 64 /* in tsk->state again */ -#define TASK_DEAD 64 -#define TASK_WAKEKILL 128 +#define TASK_DEAD 128 +#define TASK_WAKEKILL 256 /* Convenience macros for the sake of set_task_state */ #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) @@ -193,7 +222,8 @@ extern unsigned long long time_sync_thre #define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED) /* get_task_state() */ -#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \ +#define TASK_REPORT (TASK_RUNNING | TASK_RUNNING_MUTEX | \ + TASK_INTERRUPTIBLE | \ TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \ __TASK_TRACED) @@ -209,6 +239,28 @@ extern unsigned long long time_sync_thre #define set_task_state(tsk, state_value) \ set_mb((tsk)->state, (state_value)) +// #define PREEMPT_DIRECT + +#ifdef CONFIG_X86_LOCAL_APIC +extern void nmi_show_all_regs(void); +#else +# define nmi_show_all_regs() do { } while (0) +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct exec_domain; + /* * set_current_state() includes a barrier so that the write of current->state * is correctly serialised wrt the caller's subsequent test of whether to @@ -289,6 +341,12 @@ extern void scheduler_tick(void); extern void sched_show_task(struct task_struct *p); +#ifdef CONFIG_GENERIC_HARDIRQS +extern int debug_direct_keyboard; +#else +# define debug_direct_keyboard 0 +#endif + #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); extern void touch_softlockup_watchdog(void); @@ -297,17 +355,11 @@ extern int proc_dosoftlockup_thresh(stru struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); extern unsigned int softlockup_panic; -extern unsigned long sysctl_hung_task_check_count; -extern unsigned long sysctl_hung_task_timeout_secs; -extern unsigned long sysctl_hung_task_warnings; extern int softlockup_thresh; #else static inline void softlockup_tick(void) { } -static inline void spawn_softlockup_task(void) -{ -} static inline void touch_softlockup_watchdog(void) { } @@ -316,6 +368,15 @@ static inline void touch_all_softlockup_ } #endif +#ifdef CONFIG_DETECT_HUNG_TASK +extern unsigned int sysctl_hung_task_panic; +extern unsigned long sysctl_hung_task_check_count; +extern unsigned long sysctl_hung_task_timeout_secs; +extern unsigned long sysctl_hung_task_warnings; +extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, + size_t *lenp, loff_t *ppos); +#endif /* Attach to any functions which should be ignored in wchan output. */ #define __sched __attribute__((__section__(".sched.text"))) @@ -331,7 +392,14 @@ extern signed long schedule_timeout(sign extern signed long schedule_timeout_interruptible(signed long timeout); extern signed long schedule_timeout_killable(signed long timeout); extern signed long schedule_timeout_uninterruptible(signed long timeout); +asmlinkage void __schedule(void); asmlinkage void schedule(void); +extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner); +/* + * This one can be called with interrupts disabled, only + * to be used by lowlevel arch code! + */ +asmlinkage void __sched __schedule(void); struct nsproxy; struct user_namespace; @@ -479,7 +547,7 @@ struct task_cputime { struct thread_group_cputimer { struct task_cputime cputime; int running; - spinlock_t lock; + raw_spinlock_t lock; }; /* @@ -998,6 +1066,7 @@ struct sched_class { struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle); void (*pre_schedule) (struct rq *this_rq, struct task_struct *task); + int (*needs_post_schedule) (struct rq *this_rq); void (*post_schedule) (struct rq *this_rq); void (*task_wake_up) (struct rq *this_rq, struct task_struct *task); @@ -1052,6 +1121,11 @@ struct sched_entity { u64 last_wakeup; u64 avg_overlap; + u64 nr_migrations; + + u64 start_runtime; + u64 avg_wakeup; + #ifdef CONFIG_SCHEDSTATS u64 wait_start; u64 wait_max; @@ -1067,7 +1141,6 @@ struct sched_entity { u64 exec_max; u64 slice_max; - u64 nr_migrations; u64 nr_migrations_cold; u64 nr_failed_migrations_affine; u64 nr_failed_migrations_running; @@ -1121,10 +1194,8 @@ struct task_struct { int lock_depth; /* BKL lock depth */ #ifdef CONFIG_SMP -#ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif -#endif int prio, static_prio, normal_prio; unsigned int rt_priority; @@ -1164,6 +1235,7 @@ struct task_struct { #endif struct list_head tasks; + struct plist_node pushable_tasks; struct mm_struct *mm, *active_mm; @@ -1178,10 +1250,9 @@ struct task_struct { pid_t pid; pid_t tgid; -#ifdef CONFIG_CC_STACKPROTECTOR /* Canary value for the -fstack-protector gcc feature */ unsigned long stack_canary; -#endif + /* * pointers to (original) parent process, youngest child, younger sibling, * older sibling, respectively. (p->father can be replaced with @@ -1237,6 +1308,8 @@ struct task_struct { struct task_cputime cputime_expires; struct list_head cpu_timers[3]; + struct task_struct* posix_timer_list; + /* process credentials */ const struct cred *real_cred; /* objective and real subjective task * credentials (COW) */ @@ -1254,9 +1327,8 @@ struct task_struct { /* ipc stuff */ struct sysv_sem sysvsem; #endif -#ifdef CONFIG_DETECT_SOFTLOCKUP +#ifdef CONFIG_DETECT_HUNG_TASK /* hung task detection */ - unsigned long last_switch_timestamp; unsigned long last_switch_count; #endif /* CPU-specific state of this task */ @@ -1294,7 +1366,7 @@ struct task_struct { spinlock_t alloc_lock; /* Protection of the PI data structures: */ - spinlock_t pi_lock; + raw_spinlock_t pi_lock; #ifdef CONFIG_RT_MUTEXES /* PI waiters blocked on a rt_mutex held by this task */ @@ -1307,6 +1379,7 @@ struct task_struct { /* mutex deadlock detection */ struct mutex_waiter *blocked_on; #endif + int pagefault_disabled; #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; int hardirqs_enabled; @@ -1328,6 +1401,27 @@ struct task_struct { int lockdep_depth; unsigned int lockdep_recursion; struct held_lock held_locks[MAX_LOCK_DEPTH]; + gfp_t lockdep_reclaim_gfp; +#endif + +/* realtime bits */ + +#define MAX_PREEMPT_TRACE 25 +#define MAX_LOCK_STACK MAX_PREEMPT_TRACE +#ifdef CONFIG_DEBUG_PREEMPT + atomic_t lock_count; +# ifdef CONFIG_PREEMPT_RT + struct rt_mutex *owned_lock[MAX_LOCK_STACK]; +# endif +#endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + unsigned long softlockup_count; /* Count to keep track how long the + * thread is in the kernel without + * sleeping. + */ +#endif +#ifdef CONFIG_DEBUG_RT_MUTEXES + void *last_kernel_lock; #endif /* journalling filesystem info */ @@ -1370,6 +1464,7 @@ struct task_struct { struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; #endif + struct perf_counter_context perf_counter_ctx; #ifdef CONFIG_NUMA struct mempolicy *mempolicy; short il_next; @@ -1417,8 +1512,21 @@ struct task_struct { /* state flags for use by tracers */ unsigned long trace; #endif +#ifdef CONFIG_PREEMPT_RT + /* + * Temporary hack, until we find a solution to + * handle printk in atomic operations. + */ + int in_printk; +#endif }; +#ifdef CONFIG_PREEMPT_RT +# define set_printk_might_sleep(x) do { current->in_printk = x; } while(0) +#else +# define set_printk_might_sleep(x) do { } while(0) +#endif + /* * Priority of a process goes from 0..MAX_PRIO-1, valid RT * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH @@ -1583,6 +1691,15 @@ extern struct pid *cad_pid; extern void free_task(struct task_struct *tsk); #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0) +#ifdef CONFIG_PREEMPT_RT +extern void __put_task_struct_cb(struct rcu_head *rhp); + +static inline void put_task_struct(struct task_struct *t) +{ + if (atomic_dec_and_test(&t->usage)) + call_rcu(&t->rcu, __put_task_struct_cb); +} +#else extern void __put_task_struct(struct task_struct *t); static inline void put_task_struct(struct task_struct *t) @@ -1590,6 +1707,7 @@ static inline void put_task_struct(struc if (atomic_dec_and_test(&t->usage)) __put_task_struct(t); } +#endif extern cputime_t task_utime(struct task_struct *p); extern cputime_t task_stime(struct task_struct *p); @@ -1604,13 +1722,16 @@ extern cputime_t task_gtime(struct task_ #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ +#define PF_NOSCHED 0x00000020 /* Userspace does not expect scheduling */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ +#define PF_HARDIRQ 0x00000080 /* hardirq context */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_FLUSHER 0x00001000 /* responsible for disk writeback */ #define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */ +#define PF_KMAP 0x00004000 /* this context has a kmap */ #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */ #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ @@ -1623,6 +1744,7 @@ extern cputime_t task_gtime(struct task_ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ +#define PF_SOFTIRQ 0x08000000 /* softirq context */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ @@ -1751,9 +1873,14 @@ int sched_rt_handler(struct ctl_table *t extern unsigned int sysctl_sched_compat_yield; +extern void task_setprio(struct task_struct *p, int prio); + #ifdef CONFIG_RT_MUTEXES extern int rt_mutex_getprio(struct task_struct *p); -extern void rt_mutex_setprio(struct task_struct *p, int prio); +static inline void rt_mutex_setprio(struct task_struct *p, int prio) +{ + task_setprio(p, prio); +} extern void rt_mutex_adjust_pi(struct task_struct *p); #else static inline int rt_mutex_getprio(struct task_struct *p) @@ -1777,6 +1904,7 @@ extern struct task_struct *curr_task(int extern void set_curr_task(int cpu, struct task_struct *p); void yield(void); +void __yield(void); /* * The default (Linux) execution domain. @@ -1844,6 +1972,9 @@ extern void do_timer(unsigned long ticks extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); +extern int wake_up_process_mutex(struct task_struct * tsk); +extern int wake_up_process_sync(struct task_struct * tsk); +extern int wake_up_process_mutex_sync(struct task_struct * tsk); extern void wake_up_new_task(struct task_struct *tsk, unsigned long clone_flags); #ifdef CONFIG_SMP @@ -1931,12 +2062,20 @@ extern struct mm_struct * mm_alloc(void) /* mmdrop drops the mm and the page tables */ extern void __mmdrop(struct mm_struct *); +extern void __mmdrop_delayed(struct mm_struct *); + static inline void mmdrop(struct mm_struct * mm) { if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); } +static inline void mmdrop_delayed(struct mm_struct * mm) +{ + if (atomic_dec_and_test(&mm->mm_count)) + __mmdrop_delayed(mm); +} + /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); /* Grab a reference to a task's mm, if it is not already going away */ @@ -2087,6 +2226,19 @@ static inline int object_is_on_stack(voi extern void thread_info_cache_init(void); +#ifdef CONFIG_DEBUG_STACK_USAGE +static inline unsigned long stack_not_used(struct task_struct *p) +{ + unsigned long *n = end_of_stack(p); + + do { /* Skip over canary */ + n++; + } while (!*n); + + return (unsigned long)n - (unsigned long)end_of_stack(p); +} +#endif + /* set thread flags in other task's structures * - see asm/thread_info.h for TIF_xxxx flags available */ @@ -2176,19 +2328,27 @@ static inline int cond_resched(void) return _cond_resched(); } #endif -extern int cond_resched_lock(spinlock_t * lock); +extern int __cond_resched_raw_spinlock(raw_spinlock_t *lock); +extern int __cond_resched_spinlock(spinlock_t *spinlock); + +#define cond_resched_lock(lock) \ + PICK_SPIN_OP_RET(__cond_resched_raw_spinlock, __cond_resched_spinlock,\ + lock) + extern int cond_resched_softirq(void); static inline int cond_resched_bkl(void) { return _cond_resched(); } +extern int cond_resched_softirq_context(void); +extern int cond_resched_hardirq_context(void); /* * Does a critical section need to be broken due to another * task waiting?: (technically does not depend on CONFIG_PREEMPT, * but a general need for low latency) */ -static inline int spin_needbreak(spinlock_t *lock) +static inline int __raw_spin_needbreak(raw_spinlock_t *lock) { #ifdef CONFIG_PREEMPT return spin_is_contended(lock); @@ -2214,6 +2374,37 @@ static inline void thread_group_cputime_ { } +#ifdef CONFIG_PREEMPT_RT +static inline int __spin_needbreak(spinlock_t *lock) +{ + return lock->break_lock; +} +#else +static inline int __spin_needbreak(spinlock_t *lock) +{ + /* should never be call outside of RT */ + BUG(); + return 0; +} +#endif + +#define spin_needbreak(lock) \ + PICK_SPIN_OP_RET(__raw_spin_needbreak, __spin_needbreak, lock) + +static inline int softirq_need_resched(void) +{ + if (softirq_preemption && (current->flags & PF_SOFTIRQ)) + return need_resched(); + return 0; +} + +static inline int hardirq_need_resched(void) +{ + if (hardirq_preemption && (current->flags & PF_HARDIRQ)) + return need_resched(); + return 0; +} + /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. @@ -2336,6 +2527,13 @@ static inline void inc_syscw(struct task #define TASK_SIZE_OF(tsk) TASK_SIZE #endif +/* + * Call the function if the target task is executing on a CPU right now: + */ +extern void task_oncpu_function_call(struct task_struct *p, + void (*func) (void *info), void *info); + + #ifdef CONFIG_MM_OWNER extern void mm_update_next_owner(struct mm_struct *mm); extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p); @@ -2351,6 +2549,13 @@ static inline void mm_init_owner(struct #define TASK_STATE_TO_CHAR_STR "RSDTtZX" +#ifdef CONFIG_SMP +static inline int task_is_current(struct task_struct *task) +{ + return task->oncpu; +} +#endif + #endif /* __KERNEL__ */ #endif Index: linux-2.6-tip/include/linux/skbuff.h =================================================================== --- linux-2.6-tip.orig/include/linux/skbuff.h +++ linux-2.6-tip/include/linux/skbuff.h @@ -15,6 +15,7 @@ #define _LINUX_SKBUFF_H #include +#include #include #include #include @@ -100,6 +101,9 @@ struct pipe_inode_info; #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack { atomic_t use; +#ifdef CONFIG_PREEMPT_RT + struct rcu_head rcu; +#endif }; #endif @@ -295,16 +299,18 @@ struct sk_buff { }; }; __u32 priority; - __u8 local_df:1, - cloned:1, - ip_summed:2, - nohdr:1, - nfctinfo:3; - __u8 pkt_type:3, - fclone:2, - ipvs_property:1, - peeked:1, - nf_trace:1; + kmemcheck_define_bitfield(flags1, { + __u8 local_df:1, + cloned:1, + ip_summed:2, + nohdr:1, + nfctinfo:3; + __u8 pkt_type:3, + fclone:2, + ipvs_property:1, + peeked:1, + nf_trace:1; + }); __be16 protocol; void (*destructor)(struct sk_buff *skb); @@ -324,13 +330,17 @@ struct sk_buff { __u16 tc_verd; /* traffic control verdict */ #endif #endif + + kmemcheck_define_bitfield(flags2, { #ifdef CONFIG_IPV6_NDISC_NODETYPE - __u8 ndisc_nodetype:2; + __u8 ndisc_nodetype:2; #endif #if defined(CONFIG_MAC80211) || defined(CONFIG_MAC80211_MODULE) - __u8 do_not_encrypt:1; - __u8 requeue:1; + __u8 do_not_encrypt:1; + __u8 requeue:1; #endif + }); + /* 0/13/14 bit hole */ #ifdef CONFIG_NET_DMA Index: linux-2.6-tip/include/linux/slab.h =================================================================== --- linux-2.6-tip.orig/include/linux/slab.h +++ linux-2.6-tip/include/linux/slab.h @@ -62,6 +62,13 @@ # define SLAB_DEBUG_OBJECTS 0x00000000UL #endif +/* Don't track use of uninitialized memory */ +#ifdef CONFIG_KMEMCHECK +# define SLAB_NOTRACK 0x00800000UL +#else +# define SLAB_NOTRACK 0x00000000UL +#endif + /* The following flags affect the page allocator grouping pages by mobility */ #define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */ #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */ Index: linux-2.6-tip/include/linux/slab_def.h =================================================================== --- linux-2.6-tip.orig/include/linux/slab_def.h +++ linux-2.6-tip/include/linux/slab_def.h @@ -14,6 +14,88 @@ #include /* kmalloc_sizes.h needs PAGE_SIZE */ #include /* kmalloc_sizes.h needs L1_CACHE_BYTES */ #include +#include + +/* + * struct kmem_cache + * + * manages a cache. + */ + +struct kmem_cache { +/* 1) per-cpu data, touched during every alloc/free */ + struct array_cache *array[NR_CPUS]; +/* 2) Cache tunables. Protected by cache_chain_mutex */ + unsigned int batchcount; + unsigned int limit; + unsigned int shared; + + unsigned int buffer_size; + u32 reciprocal_buffer_size; +/* 3) touched by every alloc & free from the backend */ + + unsigned int flags; /* constant flags */ + unsigned int num; /* # of objs per slab */ + +/* 4) cache_grow/shrink */ + /* order of pgs per slab (2^n) */ + unsigned int gfporder; + + /* force GFP flags, e.g. GFP_DMA */ + gfp_t gfpflags; + + size_t colour; /* cache colouring range */ + unsigned int colour_off; /* colour offset */ + struct kmem_cache *slabp_cache; + unsigned int slab_size; + unsigned int dflags; /* dynamic flags */ + + /* constructor func */ + void (*ctor)(void *obj); + +/* 5) cache creation/removal */ + const char *name; + struct list_head next; + +/* 6) statistics */ +#ifdef CONFIG_DEBUG_SLAB + unsigned long num_active; + unsigned long num_allocations; + unsigned long high_mark; + unsigned long grown; + unsigned long reaped; + unsigned long errors; + unsigned long max_freeable; + unsigned long node_allocs; + unsigned long node_frees; + unsigned long node_overflow; + atomic_t allochit; + atomic_t allocmiss; + atomic_t freehit; + atomic_t freemiss; + + /* + * If debugging is enabled, then the allocator can add additional + * fields and/or padding to every object. buffer_size contains the total + * object size including these internal fields, the following two + * variables contain the offset to the user object and its size. + */ + int obj_offset; + int obj_size; +#endif /* CONFIG_DEBUG_SLAB */ + + /* + * We put nodelists[] at the end of kmem_cache, because we want to size + * this array to nr_node_ids slots instead of MAX_NUMNODES + * (see kmem_cache_init()) + * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache + * is statically defined, so we reserve the max number of nodes. + */ + struct kmem_list3 *nodelists[MAX_NUMNODES]; + /* + * Do not add fields after nodelists[] + */ +}; /* Size description struct for general caches. */ struct cache_sizes { @@ -28,8 +110,26 @@ extern struct cache_sizes malloc_sizes[] void *kmem_cache_alloc(struct kmem_cache *, gfp_t); void *__kmalloc(size_t size, gfp_t flags); -static inline void *kmalloc(size_t size, gfp_t flags) +#ifdef CONFIG_KMEMTRACE +extern void *kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags); +extern size_t slab_buffer_size(struct kmem_cache *cachep); +#else +static __always_inline void * +kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags) +{ + return kmem_cache_alloc(cachep, flags); +} +static inline size_t slab_buffer_size(struct kmem_cache *cachep) +{ + return 0; +} +#endif + +static __always_inline void *kmalloc(size_t size, gfp_t flags) { + struct kmem_cache *cachep; + void *ret; + if (__builtin_constant_p(size)) { int i = 0; @@ -47,10 +147,17 @@ static inline void *kmalloc(size_t size, found: #ifdef CONFIG_ZONE_DMA if (flags & GFP_DMA) - return kmem_cache_alloc(malloc_sizes[i].cs_dmacachep, - flags); + cachep = malloc_sizes[i].cs_dmacachep; + else #endif - return kmem_cache_alloc(malloc_sizes[i].cs_cachep, flags); + cachep = malloc_sizes[i].cs_cachep; + + ret = kmem_cache_alloc_notrace(cachep, flags); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret, + size, slab_buffer_size(cachep), flags); + + return ret; } return __kmalloc(size, flags); } @@ -59,8 +166,25 @@ found: extern void *__kmalloc_node(size_t size, gfp_t flags, int node); extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node); -static inline void *kmalloc_node(size_t size, gfp_t flags, int node) +#ifdef CONFIG_KMEMTRACE +extern void *kmem_cache_alloc_node_notrace(struct kmem_cache *cachep, + gfp_t flags, + int nodeid); +#else +static __always_inline void * +kmem_cache_alloc_node_notrace(struct kmem_cache *cachep, + gfp_t flags, + int nodeid) +{ + return kmem_cache_alloc_node(cachep, flags, nodeid); +} +#endif + +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { + struct kmem_cache *cachep; + void *ret; + if (__builtin_constant_p(size)) { int i = 0; @@ -78,11 +202,18 @@ static inline void *kmalloc_node(size_t found: #ifdef CONFIG_ZONE_DMA if (flags & GFP_DMA) - return kmem_cache_alloc_node(malloc_sizes[i].cs_dmacachep, - flags, node); + cachep = malloc_sizes[i].cs_dmacachep; + else #endif - return kmem_cache_alloc_node(malloc_sizes[i].cs_cachep, - flags, node); + cachep = malloc_sizes[i].cs_cachep; + + ret = kmem_cache_alloc_node_notrace(cachep, flags, node); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, + ret, size, slab_buffer_size(cachep), + flags, node); + + return ret; } return __kmalloc_node(size, flags, node); } Index: linux-2.6-tip/include/linux/slob_def.h =================================================================== --- linux-2.6-tip.orig/include/linux/slob_def.h +++ linux-2.6-tip/include/linux/slob_def.h @@ -3,14 +3,15 @@ void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node); -static inline void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) +static __always_inline void *kmem_cache_alloc(struct kmem_cache *cachep, + gfp_t flags) { return kmem_cache_alloc_node(cachep, flags, -1); } void *__kmalloc_node(size_t size, gfp_t flags, int node); -static inline void *kmalloc_node(size_t size, gfp_t flags, int node) +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { return __kmalloc_node(size, flags, node); } @@ -23,12 +24,12 @@ static inline void *kmalloc_node(size_t * kmalloc is the normal method of allocating memory * in the kernel. */ -static inline void *kmalloc(size_t size, gfp_t flags) +static __always_inline void *kmalloc(size_t size, gfp_t flags) { return __kmalloc_node(size, flags, -1); } -static inline void *__kmalloc(size_t size, gfp_t flags) +static __always_inline void *__kmalloc(size_t size, gfp_t flags) { return kmalloc(size, flags); } Index: linux-2.6-tip/include/linux/slub_def.h =================================================================== --- linux-2.6-tip.orig/include/linux/slub_def.h +++ linux-2.6-tip/include/linux/slub_def.h @@ -10,6 +10,7 @@ #include #include #include +#include enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -121,10 +122,23 @@ struct kmem_cache { #define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE) /* + * Maximum kmalloc object size handled by SLUB. Larger object allocations + * are passed through to the page allocator. The page allocator "fastpath" + * is relatively slow so we need this value sufficiently high so that + * performance critical objects are allocated through the SLUB fastpath. + * + * This should be dropped to PAGE_SIZE / 2 once the page allocator + * "fastpath" becomes competitive with the slab allocator fastpaths. + */ +#define SLUB_MAX_SIZE (PAGE_SIZE) + +#define SLUB_PAGE_SHIFT (PAGE_SHIFT + 1) + +/* * We keep the general caches in an array of slab caches that are used for * 2^x bytes of allocations. */ -extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1]; +extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT]; /* * Sorry that the following has to be that ugly but some versions of GCC @@ -204,15 +218,33 @@ static __always_inline struct kmem_cache void *kmem_cache_alloc(struct kmem_cache *, gfp_t); void *__kmalloc(size_t size, gfp_t flags); +#ifdef CONFIG_KMEMTRACE +extern void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags); +#else +static __always_inline void * +kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags) +{ + return kmem_cache_alloc(s, gfpflags); +} +#endif + static __always_inline void *kmalloc_large(size_t size, gfp_t flags) { - return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size)); + unsigned int order = get_order(size); + void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret, + size, PAGE_SIZE << order, flags); + + return ret; } static __always_inline void *kmalloc(size_t size, gfp_t flags) { + void *ret; + if (__builtin_constant_p(size)) { - if (size > PAGE_SIZE) + if (size > SLUB_MAX_SIZE) return kmalloc_large(size, flags); if (!(flags & SLUB_DMA)) { @@ -221,7 +253,13 @@ static __always_inline void *kmalloc(siz if (!s) return ZERO_SIZE_PTR; - return kmem_cache_alloc(s, flags); + ret = kmem_cache_alloc_notrace(s, flags); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, + _THIS_IP_, ret, + size, s->size, flags); + + return ret; } } return __kmalloc(size, flags); @@ -231,16 +269,38 @@ static __always_inline void *kmalloc(siz void *__kmalloc_node(size_t size, gfp_t flags, int node); void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node); +#ifdef CONFIG_KMEMTRACE +extern void *kmem_cache_alloc_node_notrace(struct kmem_cache *s, + gfp_t gfpflags, + int node); +#else +static __always_inline void * +kmem_cache_alloc_node_notrace(struct kmem_cache *s, + gfp_t gfpflags, + int node) +{ + return kmem_cache_alloc_node(s, gfpflags, node); +} +#endif + static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { + void *ret; + if (__builtin_constant_p(size) && - size <= PAGE_SIZE && !(flags & SLUB_DMA)) { + size <= SLUB_MAX_SIZE && !(flags & SLUB_DMA)) { struct kmem_cache *s = kmalloc_slab(size); if (!s) return ZERO_SIZE_PTR; - return kmem_cache_alloc_node(s, flags, node); + ret = kmem_cache_alloc_node_notrace(s, flags, node); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, + _THIS_IP_, ret, + size, s->size, flags, node); + + return ret; } return __kmalloc_node(size, flags, node); } Index: linux-2.6-tip/include/linux/smp.h =================================================================== --- linux-2.6-tip.orig/include/linux/smp.h +++ linux-2.6-tip/include/linux/smp.h @@ -50,6 +50,16 @@ extern void smp_send_stop(void); */ extern void smp_send_reschedule(int cpu); +/* + * trigger a reschedule on all other CPUs: + */ +extern void smp_send_reschedule_allbutself(void); + +/* + * trigger a reschedule on all other CPUs: + */ +extern void smp_send_reschedule_allbutself(void); + /* * Prepare machine for booting other CPUs. @@ -82,7 +92,8 @@ smp_call_function_mask(cpumask_t mask, v return 0; } -void __smp_call_function_single(int cpuid, struct call_single_data *data); +void __smp_call_function_single(int cpuid, struct call_single_data *data, + int wait); /* * Generic and arch helpers @@ -139,6 +150,7 @@ static inline int up_smp_call_function(v 0; \ }) static inline void smp_send_reschedule(int cpu) { } +static inline void smp_send_reschedule_allbutself(void) { } #define num_booting_cpus() 1 #define smp_prepare_boot_cpu() do {} while (0) #define smp_call_function_mask(mask, func, info, wait) \ @@ -174,7 +186,13 @@ static inline void init_call_single_data #define get_cpu() ({ preempt_disable(); smp_processor_id(); }) #define put_cpu() preempt_enable() -#define put_cpu_no_resched() preempt_enable_no_resched() +#define put_cpu_no_resched() __preempt_enable_no_resched() + +/* + * Callback to arch code if there's nosmp or maxcpus=0 on the + * boot command line: + */ +extern void arch_disable_smp_support(void); void smp_setup_processor_id(void); Index: linux-2.6-tip/include/linux/socket.h =================================================================== --- linux-2.6-tip.orig/include/linux/socket.h +++ linux-2.6-tip/include/linux/socket.h @@ -24,10 +24,12 @@ struct __kernel_sockaddr_storage { #include /* pid_t */ #include /* __user */ -#ifdef CONFIG_PROC_FS +#ifdef __KERNEL__ +# ifdef CONFIG_PROC_FS struct seq_file; extern void socket_seq_show(struct seq_file *seq); -#endif +# endif +#endif /* __KERNEL__ */ typedef unsigned short sa_family_t; Index: linux-2.6-tip/include/linux/stackprotector.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/linux/stackprotector.h @@ -0,0 +1,16 @@ +#ifndef _LINUX_STACKPROTECTOR_H +#define _LINUX_STACKPROTECTOR_H 1 + +#include +#include +#include + +#ifdef CONFIG_CC_STACKPROTECTOR +# include +#else +static inline void boot_init_stack_canary(void) +{ +} +#endif + +#endif Index: linux-2.6-tip/include/linux/stacktrace.h =================================================================== --- linux-2.6-tip.orig/include/linux/stacktrace.h +++ linux-2.6-tip/include/linux/stacktrace.h @@ -4,6 +4,8 @@ struct task_struct; #ifdef CONFIG_STACKTRACE +struct task_struct; + struct stack_trace { unsigned int nr_entries, max_entries; unsigned long *entries; @@ -11,6 +13,7 @@ struct stack_trace { }; extern void save_stack_trace(struct stack_trace *trace); +extern void save_stack_trace_bp(struct stack_trace *trace, unsigned long bp); extern void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace); Index: linux-2.6-tip/include/linux/swiotlb.h =================================================================== --- linux-2.6-tip.orig/include/linux/swiotlb.h +++ linux-2.6-tip/include/linux/swiotlb.h @@ -31,7 +31,7 @@ extern dma_addr_t swiotlb_phys_to_bus(st phys_addr_t address); extern phys_addr_t swiotlb_bus_to_phys(dma_addr_t address); -extern int swiotlb_arch_range_needs_mapping(void *ptr, size_t size); +extern int swiotlb_arch_range_needs_mapping(phys_addr_t paddr, size_t size); extern void *swiotlb_alloc_coherent(struct device *hwdev, size_t size, @@ -41,20 +41,13 @@ extern void swiotlb_free_coherent(struct device *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle); -extern dma_addr_t -swiotlb_map_single(struct device *hwdev, void *ptr, size_t size, int dir); - -extern void -swiotlb_unmap_single(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir); - -extern dma_addr_t -swiotlb_map_single_attrs(struct device *hwdev, void *ptr, size_t size, - int dir, struct dma_attrs *attrs); - -extern void -swiotlb_unmap_single_attrs(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir, struct dma_attrs *attrs); +extern dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs); +extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs); extern int swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents, @@ -66,36 +59,38 @@ swiotlb_unmap_sg(struct device *hwdev, s extern int swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems, - int dir, struct dma_attrs *attrs); + enum dma_data_direction dir, struct dma_attrs *attrs); extern void swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl, - int nelems, int dir, struct dma_attrs *attrs); + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs); extern void swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir); + size_t size, enum dma_data_direction dir); extern void swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg, - int nelems, int dir); + int nelems, enum dma_data_direction dir); extern void swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir); + size_t size, enum dma_data_direction dir); extern void swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg, - int nelems, int dir); + int nelems, enum dma_data_direction dir); extern void swiotlb_sync_single_range_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - unsigned long offset, size_t size, int dir); + unsigned long offset, size_t size, + enum dma_data_direction dir); extern void swiotlb_sync_single_range_for_device(struct device *hwdev, dma_addr_t dev_addr, unsigned long offset, size_t size, - int dir); + enum dma_data_direction dir); extern int swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr); Index: linux-2.6-tip/include/linux/syscalls.h =================================================================== --- linux-2.6-tip.orig/include/linux/syscalls.h +++ linux-2.6-tip/include/linux/syscalls.h @@ -55,6 +55,7 @@ struct compat_timeval; struct robust_list_head; struct getcpu_cache; struct old_linux_dirent; +struct perf_counter_hw_event; #include #include @@ -694,4 +695,11 @@ asmlinkage long sys_pipe(int __user *); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); + +asmlinkage int sys_perf_counter_open( + + struct perf_counter_hw_event *hw_event_uptr __user, + pid_t pid, + int cpu, + int group_fd); #endif Index: linux-2.6-tip/include/linux/timer.h =================================================================== --- linux-2.6-tip.orig/include/linux/timer.h +++ linux-2.6-tip/include/linux/timer.h @@ -5,6 +5,7 @@ #include #include #include +#include struct tvec_base; @@ -21,52 +22,126 @@ struct timer_list { char start_comm[16]; int start_pid; #endif +#ifdef CONFIG_LOCKDEP + struct lockdep_map lockdep_map; +#endif }; extern struct tvec_base boot_tvec_bases; +#ifdef CONFIG_LOCKDEP +/* + * NB: because we have to copy the lockdep_map, setting the lockdep_map key + * (second argument) here is required, otherwise it could be initialised to + * the copy of the lockdep_map later! We use the pointer to and the string + * ":" as the key resp. the name of the lockdep_map. + */ +#define __TIMER_LOCKDEP_MAP_INITIALIZER(_kn) \ + .lockdep_map = STATIC_LOCKDEP_MAP_INIT(_kn, &_kn), +#else +#define __TIMER_LOCKDEP_MAP_INITIALIZER(_kn) +#endif + #define TIMER_INITIALIZER(_function, _expires, _data) { \ .entry = { .prev = TIMER_ENTRY_STATIC }, \ .function = (_function), \ .expires = (_expires), \ .data = (_data), \ .base = &boot_tvec_bases, \ + __TIMER_LOCKDEP_MAP_INITIALIZER( \ + __FILE__ ":" __stringify(__LINE__)) \ } #define DEFINE_TIMER(_name, _function, _expires, _data) \ struct timer_list _name = \ TIMER_INITIALIZER(_function, _expires, _data) -void init_timer(struct timer_list *timer); -void init_timer_deferrable(struct timer_list *timer); +void init_timer_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key); +void init_timer_deferrable_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key); + +#ifdef CONFIG_LOCKDEP +#define init_timer(timer) \ + do { \ + static struct lock_class_key __key; \ + init_timer_key((timer), #timer, &__key); \ + } while (0) + +#define init_timer_deferrable(timer) \ + do { \ + static struct lock_class_key __key; \ + init_timer_deferrable_key((timer), #timer, &__key); \ + } while (0) + +#define init_timer_on_stack(timer) \ + do { \ + static struct lock_class_key __key; \ + init_timer_on_stack_key((timer), #timer, &__key); \ + } while (0) + +#define setup_timer(timer, fn, data) \ + do { \ + static struct lock_class_key __key; \ + setup_timer_key((timer), #timer, &__key, (fn), (data));\ + } while (0) + +#define setup_timer_on_stack(timer, fn, data) \ + do { \ + static struct lock_class_key __key; \ + setup_timer_on_stack_key((timer), #timer, &__key, \ + (fn), (data)); \ + } while (0) +#else +#define init_timer(timer)\ + init_timer_key((timer), NULL, NULL) +#define init_timer_deferrable(timer)\ + init_timer_deferrable_key((timer), NULL, NULL) +#define init_timer_on_stack(timer)\ + init_timer_on_stack_key((timer), NULL, NULL) +#define setup_timer(timer, fn, data)\ + setup_timer_key((timer), NULL, NULL, (fn), (data)) +#define setup_timer_on_stack(timer, fn, data)\ + setup_timer_on_stack_key((timer), NULL, NULL, (fn), (data)) +#endif #ifdef CONFIG_DEBUG_OBJECTS_TIMERS -extern void init_timer_on_stack(struct timer_list *timer); +extern void init_timer_on_stack_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key); extern void destroy_timer_on_stack(struct timer_list *timer); #else static inline void destroy_timer_on_stack(struct timer_list *timer) { } -static inline void init_timer_on_stack(struct timer_list *timer) +static inline void init_timer_on_stack_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key) { - init_timer(timer); + init_timer_key(timer, name, key); } #endif -static inline void setup_timer(struct timer_list * timer, +static inline void setup_timer_key(struct timer_list * timer, + const char *name, + struct lock_class_key *key, void (*function)(unsigned long), unsigned long data) { timer->function = function; timer->data = data; - init_timer(timer); + init_timer_key(timer, name, key); } -static inline void setup_timer_on_stack(struct timer_list *timer, +static inline void setup_timer_on_stack_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key, void (*function)(unsigned long), unsigned long data) { timer->function = function; timer->data = data; - init_timer_on_stack(timer); + init_timer_on_stack_key(timer, name, key); } /** @@ -86,8 +161,8 @@ static inline int timer_pending(const st extern void add_timer_on(struct timer_list *timer, int cpu); extern int del_timer(struct timer_list * timer); -extern int __mod_timer(struct timer_list *timer, unsigned long expires); extern int mod_timer(struct timer_list *timer, unsigned long expires); +extern int mod_timer_pending(struct timer_list *timer, unsigned long expires); /* * The jiffies value which is added to now, when there is no timer @@ -146,30 +221,14 @@ static inline void timer_stats_timer_cle } #endif -/** - * add_timer - start a timer - * @timer: the timer to be added - * - * The kernel will do a ->function(->data) callback from the - * timer interrupt at the ->expires point in the future. The - * current time is 'jiffies'. - * - * The timer's ->expires, ->function (and if the handler uses it, ->data) - * fields must be set prior calling this function. - * - * Timers with an ->expires field in the past will be executed in the next - * timer tick. - */ -static inline void add_timer(struct timer_list *timer) -{ - BUG_ON(timer_pending(timer)); - __mod_timer(timer, timer->expires); -} +extern void add_timer(struct timer_list *timer); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS) + extern int timer_pending_sync(struct timer_list *timer); extern int try_to_del_timer_sync(struct timer_list *timer); extern int del_timer_sync(struct timer_list *timer); #else +# define timer_pending_sync(t) timer_pending(t) # define try_to_del_timer_sync(t) del_timer(t) # define del_timer_sync(t) del_timer(t) #endif Index: linux-2.6-tip/include/linux/topology.h =================================================================== --- linux-2.6-tip.orig/include/linux/topology.h +++ linux-2.6-tip/include/linux/topology.h @@ -193,5 +193,11 @@ int arch_update_cpu_topology(void); #ifndef topology_core_siblings #define topology_core_siblings(cpu) cpumask_of_cpu(cpu) #endif +#ifndef topology_thread_cpumask +#define topology_thread_cpumask(cpu) cpumask_of(cpu) +#endif +#ifndef topology_core_cpumask +#define topology_core_cpumask(cpu) cpumask_of(cpu) +#endif #endif /* _LINUX_TOPOLOGY_H */ Index: linux-2.6-tip/include/linux/types.h =================================================================== --- linux-2.6-tip.orig/include/linux/types.h +++ linux-2.6-tip/include/linux/types.h @@ -1,6 +1,9 @@ #ifndef _LINUX_TYPES_H #define _LINUX_TYPES_H +#include + +#ifndef __ASSEMBLY__ #ifdef __KERNEL__ #define DECLARE_BITMAP(name,bits) \ @@ -9,7 +12,6 @@ #endif #include -#include #ifndef __KERNEL_STRICT_NAMES @@ -212,5 +214,5 @@ struct ustat { }; #endif /* __KERNEL__ */ - +#endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ Index: linux-2.6-tip/include/linux/ucb1400.h =================================================================== --- linux-2.6-tip.orig/include/linux/ucb1400.h +++ linux-2.6-tip/include/linux/ucb1400.h @@ -134,8 +134,8 @@ static inline void ucb1400_adc_enable(st ucb1400_reg_write(ac97, UCB_ADC_CR, UCB_ADC_ENA); } -static unsigned int ucb1400_adc_read(struct snd_ac97 *ac97, u16 adc_channel, - int adcsync) +static inline unsigned int +ucb1400_adc_read(struct snd_ac97 *ac97, u16 adc_channel, int adcsync) { unsigned int val; Index: linux-2.6-tip/include/linux/user_namespace.h =================================================================== --- linux-2.6-tip.orig/include/linux/user_namespace.h +++ linux-2.6-tip/include/linux/user_namespace.h @@ -13,6 +13,7 @@ struct user_namespace { struct kref kref; struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_struct *creator; + struct work_struct destroyer; }; extern struct user_namespace init_user_ns; Index: linux-2.6-tip/include/net/inet_sock.h =================================================================== --- linux-2.6-tip.orig/include/net/inet_sock.h +++ linux-2.6-tip/include/net/inet_sock.h @@ -17,6 +17,7 @@ #define _INET_SOCK_H +#include #include #include #include @@ -66,14 +67,16 @@ struct inet_request_sock { __be32 loc_addr; __be32 rmt_addr; __be16 rmt_port; - u16 snd_wscale : 4, - rcv_wscale : 4, - tstamp_ok : 1, - sack_ok : 1, - wscale_ok : 1, - ecn_ok : 1, - acked : 1, - no_srccheck: 1; + kmemcheck_define_bitfield(flags, { + u16 snd_wscale : 4, + rcv_wscale : 4, + tstamp_ok : 1, + sack_ok : 1, + wscale_ok : 1, + ecn_ok : 1, + acked : 1, + no_srccheck: 1; + }); struct ip_options *opt; }; @@ -198,9 +201,12 @@ static inline int inet_sk_ehashfn(const static inline struct request_sock *inet_reqsk_alloc(struct request_sock_ops *ops) { struct request_sock *req = reqsk_alloc(ops); + struct inet_request_sock *ireq = inet_rsk(req); - if (req != NULL) - inet_rsk(req)->opt = NULL; + if (req != NULL) { + kmemcheck_annotate_bitfield(ireq->flags); + ireq->opt = NULL; + } return req; } Index: linux-2.6-tip/include/net/inet_timewait_sock.h =================================================================== --- linux-2.6-tip.orig/include/net/inet_timewait_sock.h +++ linux-2.6-tip/include/net/inet_timewait_sock.h @@ -16,6 +16,7 @@ #define _INET_TIMEWAIT_SOCK_ +#include #include #include #include @@ -127,10 +128,12 @@ struct inet_timewait_sock { __be32 tw_rcv_saddr; __be16 tw_dport; __u16 tw_num; - /* And these are ours. */ - __u8 tw_ipv6only:1, - tw_transparent:1; - /* 15 bits hole, try to pack */ + kmemcheck_define_bitfield(flags, { + /* And these are ours. */ + __u8 tw_ipv6only:1, + tw_transparent:1; + /* 14 bits hole, try to pack */ + }); __u16 tw_ipv6_offset; unsigned long tw_ttd; struct inet_bind_bucket *tw_tb; Index: linux-2.6-tip/include/trace/kmemtrace.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/trace/kmemtrace.h @@ -0,0 +1,75 @@ +/* + * Copyright (C) 2008 Eduard - Gabriel Munteanu + * + * This file is released under GPL version 2. + */ + +#ifndef _LINUX_KMEMTRACE_H +#define _LINUX_KMEMTRACE_H + +#ifdef __KERNEL__ + +#include +#include + +enum kmemtrace_type_id { + KMEMTRACE_TYPE_KMALLOC = 0, /* kmalloc() or kfree(). */ + KMEMTRACE_TYPE_CACHE, /* kmem_cache_*(). */ + KMEMTRACE_TYPE_PAGES, /* __get_free_pages() and friends. */ +}; + +#ifdef CONFIG_KMEMTRACE + +extern void kmemtrace_init(void); + +extern void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr, + size_t bytes_req, + size_t bytes_alloc, + gfp_t gfp_flags, + int node); + +extern void kmemtrace_mark_free(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr); + +#else /* CONFIG_KMEMTRACE */ + +static inline void kmemtrace_init(void) +{ +} + +static inline void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr, + size_t bytes_req, + size_t bytes_alloc, + gfp_t gfp_flags, + int node) +{ +} + +static inline void kmemtrace_mark_free(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr) +{ +} + +#endif /* CONFIG_KMEMTRACE */ + +static inline void kmemtrace_mark_alloc(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr, + size_t bytes_req, + size_t bytes_alloc, + gfp_t gfp_flags) +{ + kmemtrace_mark_alloc_node(type_id, call_site, ptr, + bytes_req, bytes_alloc, gfp_flags, -1); +} + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_KMEMTRACE_H */ + Index: linux-2.6-tip/include/trace/power.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/trace/power.h @@ -0,0 +1,34 @@ +#ifndef _TRACE_POWER_H +#define _TRACE_POWER_H + +#include +#include + +enum { + POWER_NONE = 0, + POWER_CSTATE = 1, + POWER_PSTATE = 2, +}; + +struct power_trace { +#ifdef CONFIG_POWER_TRACER + ktime_t stamp; + ktime_t end; + int type; + int state; +#endif +}; + +DECLARE_TRACE(power_start, + TPPROTO(struct power_trace *it, unsigned int type, unsigned int state), + TPARGS(it, type, state)); + +DECLARE_TRACE(power_mark, + TPPROTO(struct power_trace *it, unsigned int type, unsigned int state), + TPARGS(it, type, state)); + +DECLARE_TRACE(power_end, + TPPROTO(struct power_trace *it), + TPARGS(it)); + +#endif /* _TRACE_POWER_H */ Index: linux-2.6-tip/include/trace/workqueue.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/trace/workqueue.h @@ -0,0 +1,25 @@ +#ifndef __TRACE_WORKQUEUE_H +#define __TRACE_WORKQUEUE_H + +#include +#include +#include + +DECLARE_TRACE(workqueue_insertion, + TPPROTO(struct task_struct *wq_thread, struct work_struct *work), + TPARGS(wq_thread, work)); + +DECLARE_TRACE(workqueue_execution, + TPPROTO(struct task_struct *wq_thread, struct work_struct *work), + TPARGS(wq_thread, work)); + +/* Trace the creation of one workqueue thread on a cpu */ +DECLARE_TRACE(workqueue_creation, + TPPROTO(struct task_struct *wq_thread, int cpu), + TPARGS(wq_thread, cpu)); + +DECLARE_TRACE(workqueue_destruction, + TPPROTO(struct task_struct *wq_thread), + TPARGS(wq_thread)); + +#endif /* __TRACE_WORKQUEUE_H */ Index: linux-2.6-tip/init/Kconfig =================================================================== --- linux-2.6-tip.orig/init/Kconfig +++ linux-2.6-tip/init/Kconfig @@ -246,6 +246,7 @@ choice config CLASSIC_RCU bool "Classic RCU" + depends on !PREEMPT_RT help This option selects the classic RCU implementation that is designed for best read-side performance on non-realtime @@ -255,6 +256,7 @@ config CLASSIC_RCU config TREE_RCU bool "Tree-based hierarchical RCU" + depends on !PREEMPT_RT help This option selects the RCU implementation that is designed for very large SMP system with hundreds or @@ -869,6 +871,36 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config HAVE_PERF_COUNTERS + bool + +menu "Performance Counters" + +config PERF_COUNTERS + bool "Kernel Performance Counters" + depends on HAVE_PERF_COUNTERS + default y + select ANON_INODES + help + Enable kernel support for performance counter hardware. + + Performance counters are special hardware registers available + on most modern CPUs. These registers count the number of certain + types of hw events: such as instructions executed, cachemisses + suffered, or branches mis-predicted - without slowing down the + kernel or applications. These registers can also trigger interrupts + when a threshold number of events have passed - and can thus be + used to profile the code that runs on that CPU. + + The Linux Performance Counter subsystem provides an abstraction of + these hardware capabilities, available via a system call. It + provides per task and per CPU counters, and it provides event + capabilities on top of those. + + Say Y if unsure. + +endmenu + config VM_EVENT_COUNTERS default y bool "Enable VM event counters for /proc/vmstat" if EMBEDDED @@ -912,6 +944,7 @@ config SLAB config SLUB bool "SLUB (Unqueued Allocator)" + depends on !PREEMPT_RT help SLUB is a slab allocator that minimizes cache line usage instead of managing queues of cached objects (SLAB approach). @@ -945,7 +978,7 @@ config TRACEPOINTS config MARKERS bool "Activate markers" - depends on TRACEPOINTS + select TRACEPOINTS help Place an empty function call at each marker site. Can be dynamically changed for a probe function. @@ -966,7 +999,6 @@ config SLABINFO config RT_MUTEXES boolean - select PLIST config BASE_SMALL int Index: linux-2.6-tip/init/do_mounts.c =================================================================== --- linux-2.6-tip.orig/init/do_mounts.c +++ linux-2.6-tip/init/do_mounts.c @@ -228,9 +228,13 @@ static int __init do_mount_root(char *na return 0; } +#if PAGE_SIZE < PATH_MAX +# error increase the fs_names allocation size here +#endif + void __init mount_block_root(char *name, int flags) { - char *fs_names = __getname(); + char *fs_names = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); char *p; #ifdef CONFIG_BLOCK char b[BDEVNAME_SIZE]; @@ -282,7 +286,7 @@ retry: #endif panic("VFS: Unable to mount root fs on %s", b); out: - putname(fs_names); + free_pages((unsigned long)fs_names, 1); } #ifdef CONFIG_ROOT_NFS Index: linux-2.6-tip/init/main.c =================================================================== --- linux-2.6-tip.orig/init/main.c +++ linux-2.6-tip/init/main.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -35,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -48,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -61,6 +64,7 @@ #include #include #include +#include #include #include #include @@ -70,6 +74,7 @@ #include #include #include +#include #ifdef CONFIG_X86_LOCAL_APIC #include @@ -97,7 +102,7 @@ static inline void mark_rodata_ro(void) extern void tc_init(void); #endif -enum system_states system_state; +enum system_states system_state __read_mostly; EXPORT_SYMBOL(system_state); /* @@ -135,14 +140,14 @@ unsigned int __initdata setup_max_cpus = * greater than 0, limits the maximum number of CPUs activated in * SMP mode to . */ -#ifndef CONFIG_X86_IO_APIC -static inline void disable_ioapic_setup(void) {}; -#endif + +void __weak arch_disable_smp_support(void) { } static int __init nosmp(char *str) { setup_max_cpus = 0; - disable_ioapic_setup(); + arch_disable_smp_support(); + return 0; } @@ -152,14 +157,14 @@ static int __init maxcpus(char *str) { get_option(&str, &setup_max_cpus); if (setup_max_cpus == 0) - disable_ioapic_setup(); + arch_disable_smp_support(); return 0; } early_param("maxcpus", maxcpus); #else -#define setup_max_cpus NR_CPUS +const unsigned int setup_max_cpus = NR_CPUS; #endif /* @@ -452,6 +457,8 @@ static noinline void __init_refok rest_i { int pid; + system_state = SYSTEM_BOOTING_SCHEDULER_OK; + kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND); numa_default_policy(); pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); @@ -463,7 +470,7 @@ static noinline void __init_refok rest_i * at least once to get things moving: */ init_idle_bootup_task(current); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); @@ -539,6 +546,12 @@ asmlinkage void __init start_kernel(void */ lockdep_init(); debug_objects_early_init(); + + /* + * Set up the the initial canary ASAP: + */ + boot_init_stack_canary(); + cgroup_init_early(); local_irq_disable(); @@ -573,8 +586,10 @@ asmlinkage void __init start_kernel(void * fragile until we cpu_idle() for the first time. */ preempt_disable(); + build_all_zonelists(); page_alloc_init(); + early_init_hardirqs(); printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line); parse_early_param(); parse_args("Booting kernel", static_command_line, __start___param, @@ -641,6 +656,7 @@ asmlinkage void __init start_kernel(void enable_debug_pagealloc(); cpu_hotplug_init(); kmem_cache_init(); + kmemtrace_init(); debug_objects_mem_init(); idr_init_cache(); setup_per_cpu_pageset(); @@ -682,6 +698,9 @@ asmlinkage void __init start_kernel(void ftrace_init(); +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif /* Do the rest non-__init'ed, we're now alive */ rest_init(); } @@ -771,9 +790,14 @@ static void __init do_basic_setup(void) static void __init do_pre_smp_initcalls(void) { initcall_t *call; + extern int spawn_desched_task(void); + + /* kmemcheck must initialize before all early initcalls: */ + kmemcheck_init(); for (call = __initcall_start; call < __early_initcall_end; call++) do_one_initcall(*call); + spawn_desched_task(); } static void run_init_process(char *init_filename) @@ -808,6 +832,9 @@ static noinline int init_post(void) printk(KERN_WARNING "Failed to execute %s\n", ramdisk_execute_command); } +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif /* * We try each of these until one succeeds. @@ -849,6 +876,8 @@ static int __init kernel_init(void * unu smp_prepare_cpus(setup_max_cpus); + init_hardirqs(); + do_pre_smp_initcalls(); start_boot_trace(); @@ -871,7 +900,54 @@ static int __init kernel_init(void * unu ramdisk_execute_command = NULL; prepare_namespace(); } +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif + +#define DEBUG_COUNT (defined(CONFIG_DEBUG_RT_MUTEXES) + defined(CONFIG_IRQSOFF_TRACER) + defined(CONFIG_PREEMPT_TRACER) + defined(CONFIG_STACK_TRACER) + defined(CONFIG_WAKEUP_LATENCY_HIST) + defined(CONFIG_DEBUG_SLAB) + defined(CONFIG_DEBUG_PAGEALLOC) + defined(CONFIG_LOCKDEP) + (defined(CONFIG_FTRACE) - defined(CONFIG_FTRACE_MCOUNT_RECORD))) +#if DEBUG_COUNT > 0 + printk(KERN_ERR "*****************************************************************************\n"); + printk(KERN_ERR "* *\n"); +#if DEBUG_COUNT == 1 + printk(KERN_ERR "* REMINDER, the following debugging option is turned on in your .config: *\n"); +#else + printk(KERN_ERR "* REMINDER, the following debugging options are turned on in your .config: *\n"); +#endif + printk(KERN_ERR "* *\n"); +#ifdef CONFIG_DEBUG_RT_MUTEXES + printk(KERN_ERR "* CONFIG_DEBUG_RT_MUTEXES *\n"); +#endif +#ifdef CONFIG_IRQSOFF_TRACER + printk(KERN_ERR "* CONFIG_IRQSOFF_TRACER *\n"); +#endif +#ifdef CONFIG_PREEMPT_TRACER + printk(KERN_ERR "* CONFIG_PREEMPT_TRACER *\n"); +#endif +#ifdef CONFIG_FTRACE + printk(KERN_ERR "* CONFIG_FTRACE *\n"); +#endif +#ifdef CONFIG_WAKEUP_LATENCY_HIST + printk(KERN_ERR "* CONFIG_WAKEUP_LATENCY_HIST *\n"); +#endif +#ifdef CONFIG_DEBUG_SLAB + printk(KERN_ERR "* CONFIG_DEBUG_SLAB *\n"); +#endif +#ifdef CONFIG_DEBUG_PAGEALLOC + printk(KERN_ERR "* CONFIG_DEBUG_PAGEALLOC *\n"); +#endif +#ifdef CONFIG_LOCKDEP + printk(KERN_ERR "* CONFIG_LOCKDEP *\n"); +#endif + printk(KERN_ERR "* *\n"); +#if DEBUG_COUNT == 1 + printk(KERN_ERR "* it may increase runtime overhead and latencies. *\n"); +#else + printk(KERN_ERR "* they may increase runtime overhead and latencies. *\n"); +#endif + printk(KERN_ERR "* *\n"); + printk(KERN_ERR "*****************************************************************************\n"); +#endif /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the @@ -879,5 +955,7 @@ static int __init kernel_init(void * unu */ init_post(); + WARN_ON(debug_direct_keyboard); + return 0; } Index: linux-2.6-tip/kernel/Makefile =================================================================== --- linux-2.6-tip.orig/kernel/Makefile +++ linux-2.6-tip/kernel/Makefile @@ -7,7 +7,7 @@ obj-y = sched.o fork.o exec_domain.o sysctl.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ rcupdate.o extable.o params.o posix-timers.o \ - kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ + kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ async.o @@ -27,7 +27,10 @@ obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += time/ +ifneq ($(CONFIG_PREEMPT_RT),y) +obj-y += mutex.o obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o +endif obj-$(CONFIG_LOCKDEP) += lockdep.o ifeq ($(CONFIG_PROC_FS),y) obj-$(CONFIG_LOCKDEP) += lockdep_proc.o @@ -39,6 +42,7 @@ endif obj-$(CONFIG_RT_MUTEXES) += rtmutex.o obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o +obj-$(CONFIG_PREEMPT_RT) += rt.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_USE_GENERIC_SMP_HELPERS) += smp.o ifneq ($(CONFIG_SMP),y) @@ -74,6 +78,7 @@ obj-$(CONFIG_AUDIT_TREE) += audit_tree.o obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_KGDB) += kgdb.o obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o +obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o @@ -93,6 +98,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) obj-$(CONFIG_FUNCTION_TRACER) += trace/ obj-$(CONFIG_TRACING) += trace/ obj-$(CONFIG_SMP) += sched_cpupri.o +obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) # According to Alan Modra , the -fno-omit-frame-pointer is Index: linux-2.6-tip/kernel/exit.c =================================================================== --- linux-2.6-tip.orig/kernel/exit.c +++ linux-2.6-tip/kernel/exit.c @@ -75,7 +75,9 @@ static void __unhash_process(struct task detach_pid(p, PIDTYPE_SID); list_del_rcu(&p->tasks); + preempt_disable(); __get_cpu_var(process_counts)--; + preempt_enable(); } list_del_rcu(&p->thread_group); list_del_init(&p->sibling); @@ -162,6 +164,9 @@ static void delayed_put_task_struct(stru { struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); +#ifdef CONFIG_PERF_COUNTERS + WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list)); +#endif trace_sched_process_free(tsk); put_task_struct(tsk); } @@ -723,9 +728,11 @@ static void exit_mm(struct task_struct * task_lock(tsk); tsk->mm = NULL; up_read(&mm->mmap_sem); + preempt_disable(); // FIXME enter_lazy_tlb(mm, current); /* We don't want this task to be frozen prematurely */ clear_freeze_flag(tsk); + preempt_enable(); task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); @@ -980,12 +987,9 @@ static void check_stack_usage(void) { static DEFINE_SPINLOCK(low_water_lock); static int lowest_to_date = THREAD_SIZE; - unsigned long *n = end_of_stack(current); unsigned long free; - while (*n == 0) - n++; - free = (unsigned long)n - (unsigned long)end_of_stack(current); + free = stack_not_used(current); if (free >= lowest_to_date) return; @@ -1096,10 +1100,6 @@ NORET_TYPE void do_exit(long code) tsk->mempolicy = NULL; #endif #ifdef CONFIG_FUTEX - /* - * This must happen late, after the PID is not - * hashed anymore: - */ if (unlikely(!list_empty(&tsk->pi_state_list))) exit_pi_state_list(tsk); if (unlikely(current->pi_state_cache)) @@ -1122,14 +1122,17 @@ NORET_TYPE void do_exit(long code) if (tsk->splice_pipe) __free_pipe_info(tsk->splice_pipe); - preempt_disable(); +again: + local_irq_disable(); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; - schedule(); - BUG(); - /* Avoid "noreturn function does return". */ - for (;;) - cpu_relax(); /* For when BUG is null */ + __schedule(); + printk(KERN_ERR "BUG: dead task %s:%d back from the grave!\n", + current->comm, current->pid); + printk(KERN_ERR ".... flags: %08x, count: %d, state: %08lx\n", + current->flags, atomic_read(¤t->usage), current->state); + printk(KERN_ERR ".... trying again ...\n"); + goto again; } EXPORT_SYMBOL_GPL(do_exit); @@ -1366,6 +1369,12 @@ static int wait_task_zombie(struct task_ */ read_unlock(&tasklist_lock); + /* + * Flush inherited counters to the parent - before the parent + * gets woken up by child-exit notifications. + */ + perf_counter_exit_task(p); + retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0; status = (p->signal->flags & SIGNAL_GROUP_EXIT) ? p->signal->group_exit_code : p->exit_code; @@ -1572,6 +1581,7 @@ static int wait_consider_task(struct tas int __user *stat_addr, struct rusage __user *ru) { int ret = eligible_child(type, pid, options, p); + BUG_ON(!atomic_read(&p->usage)); if (!ret) return ret; Index: linux-2.6-tip/kernel/extable.c =================================================================== --- linux-2.6-tip.orig/kernel/extable.c +++ linux-2.6-tip/kernel/extable.c @@ -41,7 +41,7 @@ const struct exception_table_entry *sear return e; } -__notrace_funcgraph int core_kernel_text(unsigned long addr) +int core_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) @@ -54,7 +54,7 @@ __notrace_funcgraph int core_kernel_text return 0; } -__notrace_funcgraph int __kernel_text_address(unsigned long addr) +int __kernel_text_address(unsigned long addr) { if (core_kernel_text(addr)) return 1; Index: linux-2.6-tip/kernel/fork.c =================================================================== --- linux-2.6-tip.orig/kernel/fork.c +++ linux-2.6-tip/kernel/fork.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -49,6 +50,8 @@ #include #include #include +#include +#include #include #include #include @@ -61,6 +64,7 @@ #include #include #include +#include #include #include @@ -79,10 +83,23 @@ int max_threads; /* tunable limit on nr DEFINE_PER_CPU(unsigned long, process_counts) = 0; +#ifdef CONFIG_PREEMPT_RT +DEFINE_RWLOCK(tasklist_lock); /* outer */ +#else __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ +#endif DEFINE_TRACE(sched_process_fork); +/* + * Delayed mmdrop. In the PREEMPT_RT case we + * dont want to do this from the scheduling + * context. + */ +static DEFINE_PER_CPU(struct task_struct *, desched_task); + +static DEFINE_PER_CPU(struct list_head, delayed_drop_list); + int nr_processes(void) { int cpu; @@ -159,6 +176,16 @@ void __put_task_struct(struct task_struc free_task(tsk); } +#ifdef CONFIG_PREEMPT_RT +void __put_task_struct_cb(struct rcu_head *rhp) +{ + struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); + + __put_task_struct(tsk); + +} +#endif + /* * macro override instead of weak attribute alias, to workaround * gcc 4.1.0 and 4.1.1 bugs with weak attribute and empty functions. @@ -169,6 +196,8 @@ void __put_task_struct(struct task_struc void __init fork_init(unsigned long mempages) { + int i; + #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR #ifndef ARCH_MIN_TASKALIGN #define ARCH_MIN_TASKALIGN L1_CACHE_BYTES @@ -176,7 +205,7 @@ void __init fork_init(unsigned long memp /* create a slab on which task_structs can be allocated */ task_struct_cachep = kmem_cache_create("task_struct", sizeof(struct task_struct), - ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL); + ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL); #endif /* do the arch specific task caches init */ @@ -199,6 +228,9 @@ void __init fork_init(unsigned long memp init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; init_task.signal->rlim[RLIMIT_SIGPENDING] = init_task.signal->rlim[RLIMIT_NPROC]; + + for (i = 0; i < NR_CPUS; i++) + INIT_LIST_HEAD(&per_cpu(delayed_drop_list, i)); } int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst, @@ -212,6 +244,8 @@ static struct task_struct *dup_task_stru { struct task_struct *tsk; struct thread_info *ti; + unsigned long *stackend; + int err; prepare_to_copy(orig); @@ -237,6 +271,8 @@ static struct task_struct *dup_task_stru goto out; setup_thread_stack(tsk, orig); + stackend = end_of_stack(tsk); + *stackend = STACK_END_MAGIC; /* for overflow detection */ #ifdef CONFIG_CC_STACKPROTECTOR tsk->stack_canary = get_random_int(); @@ -276,6 +312,7 @@ static int dup_mmap(struct mm_struct *mm mm->locked_vm = 0; mm->mmap = NULL; mm->mmap_cache = NULL; + INIT_LIST_HEAD(&mm->delayed_drop); mm->free_area_cache = oldmm->mmap_base; mm->cached_hole_size = ~0UL; mm->map_count = 0; @@ -639,6 +676,9 @@ static int copy_mm(unsigned long clone_f tsk->min_flt = tsk->maj_flt = 0; tsk->nvcsw = tsk->nivcsw = 0; +#ifdef CONFIG_DETECT_HUNG_TASK + tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw; +#endif tsk->mm = NULL; tsk->active_mm = NULL; @@ -913,6 +953,9 @@ static void rt_mutex_init_task(struct ta #ifdef CONFIG_RT_MUTEXES plist_head_init(&p->pi_waiters, &p->pi_lock); p->pi_blocked_on = NULL; +# ifdef CONFIG_DEBUG_RT_MUTEXES + p->last_kernel_lock = NULL; +# endif #endif } @@ -984,6 +1027,7 @@ static struct task_struct *copy_process( goto fork_out; rt_mutex_init_task(p); + perf_counter_init_task(p); #ifdef CONFIG_PROVE_LOCKING DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); @@ -1041,16 +1085,11 @@ static struct task_struct *copy_process( p->default_timer_slack_ns = current->timer_slack_ns; -#ifdef CONFIG_DETECT_SOFTLOCKUP - p->last_switch_count = 0; - p->last_switch_timestamp = 0; -#endif - task_io_accounting_init(&p->ioac); acct_clear_integrals(p); posix_cpu_timers_init(p); - + p->posix_timer_list = NULL; p->lock_depth = -1; /* -1 = no lock */ do_posix_clock_monotonic_gettime(&p->start_time); p->real_start_time = p->start_time; @@ -1086,6 +1125,7 @@ static struct task_struct *copy_process( p->hardirq_context = 0; p->softirq_context = 0; #endif + p->pagefault_disabled = 0; #ifdef CONFIG_LOCKDEP p->lockdep_depth = 0; /* no locks held yet */ p->curr_chain_key = 0; @@ -1123,6 +1163,9 @@ static struct task_struct *copy_process( retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs); if (retval) goto bad_fork_cleanup_io; +#ifdef CONFIG_DEBUG_PREEMPT + atomic_set(&p->lock_count, 0); +#endif if (pid != &init_struct_pid) { retval = -ENOMEM; @@ -1213,11 +1256,13 @@ static struct task_struct *copy_process( * to ensure it is on a valid CPU (and if not, just force it back to * parent's CPU). This avoids alot of nasty races. */ + preempt_disable(); p->cpus_allowed = current->cpus_allowed; p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed; if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) || !cpu_online(task_cpu(p)))) set_task_cpu(p, smp_processor_id()); + preempt_enable(); /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) @@ -1264,7 +1309,9 @@ static struct task_struct *copy_process( attach_pid(p, PIDTYPE_PGID, task_pgrp(current)); attach_pid(p, PIDTYPE_SID, task_session(current)); list_add_tail_rcu(&p->tasks, &init_task.tasks); + preempt_disable(); __get_cpu_var(process_counts)++; + preempt_enable(); } attach_pid(p, PIDTYPE_PID, pid); nr_threads++; @@ -1470,20 +1517,20 @@ void __init proc_caches_init(void) { sighand_cachep = kmem_cache_create("sighand_cache", sizeof(struct sighand_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU, - sighand_ctor); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU| + SLAB_NOTRACK, sighand_ctor); signal_cachep = kmem_cache_create("signal_cache", sizeof(struct signal_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); files_cachep = kmem_cache_create("files_cache", sizeof(struct files_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); fs_cachep = kmem_cache_create("fs_cache", sizeof(struct fs_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); mm_cachep = kmem_cache_create("mm_struct", sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL); mmap_init(); } @@ -1730,3 +1777,124 @@ int unshare_files(struct files_struct ** task_unlock(task); return 0; } + +static int mmdrop_complete(void) +{ + struct list_head *head; + int ret = 0; + + head = &get_cpu_var(delayed_drop_list); + while (!list_empty(head)) { + struct mm_struct *mm = list_entry(head->next, + struct mm_struct, delayed_drop); + list_del(&mm->delayed_drop); + put_cpu_var(delayed_drop_list); + + __mmdrop(mm); + ret = 1; + + head = &get_cpu_var(delayed_drop_list); + } + put_cpu_var(delayed_drop_list); + + return ret; +} + +/* + * We dont want to do complex work from the scheduler, thus + * we delay the work to a per-CPU worker thread: + */ +void __mmdrop_delayed(struct mm_struct *mm) +{ + struct task_struct *desched_task; + struct list_head *head; + + head = &get_cpu_var(delayed_drop_list); + list_add_tail(&mm->delayed_drop, head); + desched_task = __get_cpu_var(desched_task); + if (desched_task) + wake_up_process(desched_task); + put_cpu_var(delayed_drop_list); +} + +static int desched_thread(void * __bind_cpu) +{ + set_user_nice(current, -10); + current->flags |= PF_NOFREEZE | PF_SOFTIRQ; + + set_current_state(TASK_INTERRUPTIBLE); + + while (!kthread_should_stop()) { + + if (mmdrop_complete()) + continue; + schedule(); + + /* + * This must be called from time to time on ia64, and is a + * no-op on other archs. Used to be in cpu_idle(), but with + * the new -rt semantics it can't stay there. + */ + check_pgt_cache(); + + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +static int __devinit cpu_callback(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int hotcpu = (unsigned long)hcpu; + struct task_struct *p; + + switch (action) { + case CPU_UP_PREPARE: + + BUG_ON(per_cpu(desched_task, hotcpu)); + INIT_LIST_HEAD(&per_cpu(delayed_drop_list, hotcpu)); + p = kthread_create(desched_thread, hcpu, "desched/%d", hotcpu); + if (IS_ERR(p)) { + printk("desched_thread for %i failed\n", hotcpu); + return NOTIFY_BAD; + } + per_cpu(desched_task, hotcpu) = p; + kthread_bind(p, hotcpu); + break; + case CPU_ONLINE: + + wake_up_process(per_cpu(desched_task, hotcpu)); + break; +#ifdef CONFIG_HOTPLUG_CPU + case CPU_UP_CANCELED: + + /* Unbind so it can run. Fall thru. */ + kthread_bind(per_cpu(desched_task, hotcpu), smp_processor_id()); + case CPU_DEAD: + + p = per_cpu(desched_task, hotcpu); + per_cpu(desched_task, hotcpu) = NULL; + kthread_stop(p); + takeover_tasklets(hotcpu); + break; +#endif /* CONFIG_HOTPLUG_CPU */ + } + return NOTIFY_OK; +} + +static struct notifier_block __devinitdata cpu_nfb = { + .notifier_call = cpu_callback +}; + +__init int spawn_desched_task(void) +{ + void *cpu = (void *)(long)smp_processor_id(); + + cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu); + cpu_callback(&cpu_nfb, CPU_ONLINE, cpu); + register_cpu_notifier(&cpu_nfb); + return 0; +} + Index: linux-2.6-tip/kernel/hung_task.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/hung_task.c @@ -0,0 +1,217 @@ +/* + * Detect Hung Task + * + * kernel/hung_task.c - kernel thread for detecting tasks stuck in D state + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * The number of tasks checked: + */ +unsigned long __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; + +/* + * Limit number of tasks checked in a batch. + * + * This value controls the preemptibility of khungtaskd since preemption + * is disabled during the critical section. It also controls the size of + * the RCU grace period. So it needs to be upper-bound. + */ +#define HUNG_TASK_BATCHING 1024 + +/* + * Zero means infinite timeout - no checking done: + */ +unsigned long __read_mostly sysctl_hung_task_timeout_secs = 120; + +unsigned long __read_mostly sysctl_hung_task_warnings = 10; + +static int __read_mostly did_panic; + +static struct task_struct *watchdog_task; + +/* + * Should we panic (and reboot, if panic_timeout= is set) when a + * hung task is detected: + */ +unsigned int __read_mostly sysctl_hung_task_panic = + CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE; + +static int __init hung_task_panic_setup(char *str) +{ + sysctl_hung_task_panic = simple_strtoul(str, NULL, 0); + + return 1; +} +__setup("hung_task_panic=", hung_task_panic_setup); + +static int +hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) +{ + did_panic = 1; + + return NOTIFY_DONE; +} + +static struct notifier_block panic_block = { + .notifier_call = hung_task_panic, +}; + +static void check_hung_task(struct task_struct *t, unsigned long timeout) +{ + unsigned long switch_count = t->nvcsw + t->nivcsw; + + /* + * Ensure the task is not frozen. + * Also, when a freshly created task is scheduled once, changes + * its state to TASK_UNINTERRUPTIBLE without having ever been + * switched out once, it musn't be checked. + */ + if (unlikely(t->flags & PF_FROZEN || !switch_count)) + return; + + if (switch_count != t->last_switch_count) { + t->last_switch_count = switch_count; + return; + } + if (!sysctl_hung_task_warnings) + return; + sysctl_hung_task_warnings--; + + /* + * Ok, the task did not get scheduled for more than 2 minutes, + * complain: + */ + printk(KERN_ERR "INFO: task %s:%d blocked for more than " + "%ld seconds.\n", t->comm, t->pid, timeout); + printk(KERN_ERR "\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" + " disables this message.\n"); + sched_show_task(t); + __debug_show_held_locks(t); + + touch_nmi_watchdog(); + + if (sysctl_hung_task_panic) + panic("hung_task: blocked tasks"); +} + +/* + * To avoid extending the RCU grace period for an unbounded amount of time, + * periodically exit the critical section and enter a new one. + * + * For preemptible RCU it is sufficient to call rcu_read_unlock in order + * exit the grace period. For classic RCU, a reschedule is required. + */ +static void rcu_lock_break(struct task_struct *g, struct task_struct *t) +{ + get_task_struct(g); + get_task_struct(t); + rcu_read_unlock(); + cond_resched(); + rcu_read_lock(); + put_task_struct(t); + put_task_struct(g); +} + +/* + * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for + * a really long time (120 seconds). If that happens, print out + * a warning. + */ +static void check_hung_uninterruptible_tasks(unsigned long timeout) +{ + int max_count = sysctl_hung_task_check_count; + int batch_count = HUNG_TASK_BATCHING; + struct task_struct *g, *t; + + /* + * If the system crashed already then all bets are off, + * do not report extra hung tasks: + */ + if (test_taint(TAINT_DIE) || did_panic) + return; + + rcu_read_lock(); + do_each_thread(g, t) { + if (!--max_count) + goto unlock; + if (!--batch_count) { + batch_count = HUNG_TASK_BATCHING; + rcu_lock_break(g, t); + /* Exit if t or g was unhashed during refresh. */ + if (t->state == TASK_DEAD || g->state == TASK_DEAD) + goto unlock; + } + /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ + if (t->state == TASK_UNINTERRUPTIBLE) + check_hung_task(t, timeout); + } while_each_thread(g, t); + unlock: + rcu_read_unlock(); +} + +static unsigned long timeout_jiffies(unsigned long timeout) +{ + /* timeout of 0 will disable the watchdog */ + return timeout ? timeout * HZ : MAX_SCHEDULE_TIMEOUT; +} + +/* + * Process updating of timeout sysctl + */ +int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, + size_t *lenp, loff_t *ppos) +{ + int ret; + + ret = proc_doulongvec_minmax(table, write, filp, buffer, lenp, ppos); + + if (ret || !write) + goto out; + + wake_up_process(watchdog_task); + + out: + return ret; +} + +/* + * kthread which checks for tasks stuck in D state + */ +static int watchdog(void *dummy) +{ + set_user_nice(current, 0); + + for ( ; ; ) { + unsigned long timeout = sysctl_hung_task_timeout_secs; + + while (schedule_timeout_interruptible(timeout_jiffies(timeout))) + timeout = sysctl_hung_task_timeout_secs; + + check_hung_uninterruptible_tasks(timeout); + } + + return 0; +} + +static int __init hung_task_init(void) +{ + atomic_notifier_chain_register(&panic_notifier_list, &panic_block); + watchdog_task = kthread_run(watchdog, NULL, "khungtaskd"); + + return 0; +} + +module_init(hung_task_init); Index: linux-2.6-tip/kernel/irq/chip.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/chip.c +++ linux-2.6-tip/kernel/irq/chip.c @@ -46,7 +46,10 @@ void dynamic_irq_init(unsigned int irq) desc->irq_count = 0; desc->irqs_unhandled = 0; #ifdef CONFIG_SMP - cpumask_setall(&desc->affinity); + cpumask_setall(desc->affinity); +#ifdef CONFIG_GENERIC_PENDING_IRQ + cpumask_clear(desc->pending_mask); +#endif #endif spin_unlock_irqrestore(&desc->lock, flags); } @@ -78,6 +81,7 @@ void dynamic_irq_cleanup(unsigned int ir desc->handle_irq = handle_bad_irq; desc->chip = &no_irq_chip; desc->name = NULL; + clear_kstat_irqs(desc); spin_unlock_irqrestore(&desc->lock, flags); } @@ -289,8 +293,11 @@ static inline void mask_ack_irq(struct i if (desc->chip->mask_ack) desc->chip->mask_ack(irq); else { - desc->chip->mask(irq); - desc->chip->ack(irq); + if (desc->chip->ack) + if (desc->chip->mask) + desc->chip->mask(irq); + if (desc->chip->ack) + desc->chip->ack(irq); } } @@ -314,8 +321,10 @@ handle_simple_irq(unsigned int irq, stru spin_lock(&desc->lock); - if (unlikely(desc->status & IRQ_INPROGRESS)) + if (unlikely(desc->status & IRQ_INPROGRESS)) { + desc->status |= IRQ_PENDING; goto out_unlock; + } desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); kstat_incr_irqs_this_cpu(irq, desc); @@ -324,6 +333,11 @@ handle_simple_irq(unsigned int irq, stru goto out_unlock; desc->status |= IRQ_INPROGRESS; + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); @@ -332,6 +346,8 @@ handle_simple_irq(unsigned int irq, stru spin_lock(&desc->lock); desc->status &= ~IRQ_INPROGRESS; + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); out_unlock: spin_unlock(&desc->lock); } @@ -370,6 +386,13 @@ handle_level_irq(unsigned int irq, struc goto out_unlock; desc->status |= IRQ_INPROGRESS; + + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; + spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); @@ -403,18 +426,16 @@ handle_fasteoi_irq(unsigned int irq, str spin_lock(&desc->lock); - if (unlikely(desc->status & IRQ_INPROGRESS)) - goto out; - desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); kstat_incr_irqs_this_cpu(irq, desc); /* - * If its disabled or no action available + * If it's running, disabled or no action available * then mask it and get out of here: */ action = desc->action; - if (unlikely(!action || (desc->status & IRQ_DISABLED))) { + if (unlikely(!action || (desc->status & (IRQ_INPROGRESS | + IRQ_DISABLED)))) { desc->status |= IRQ_PENDING; if (desc->chip->mask) desc->chip->mask(irq); @@ -422,6 +443,15 @@ handle_fasteoi_irq(unsigned int irq, str } desc->status |= IRQ_INPROGRESS; + /* + * In the threaded case we fall back to a mask+eoi sequence: + */ + if (redirect_hardirq(desc)) { + if (desc->chip->mask) + desc->chip->mask(irq); + goto out; + } + desc->status &= ~IRQ_PENDING; spin_unlock(&desc->lock); @@ -431,10 +461,11 @@ handle_fasteoi_irq(unsigned int irq, str spin_lock(&desc->lock); desc->status &= ~IRQ_INPROGRESS; + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); out: desc->chip->eoi(irq); desc = irq_remap_to_desc(irq, desc); - spin_unlock(&desc->lock); } @@ -476,12 +507,19 @@ handle_edge_irq(unsigned int irq, struct kstat_incr_irqs_this_cpu(irq, desc); /* Start handling the irq */ - desc->chip->ack(irq); + if (desc->chip->ack) + desc->chip->ack(irq); desc = irq_remap_to_desc(irq, desc); /* Mark the IRQ currently in progress.*/ desc->status |= IRQ_INPROGRESS; + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; + do { struct irqaction *action = desc->action; irqreturn_t action_ret; Index: linux-2.6-tip/kernel/irq/handle.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/handle.c +++ linux-2.6-tip/kernel/irq/handle.c @@ -13,10 +13,12 @@ #include #include #include +#include #include #include #include #include +#include #include "internals.h" @@ -69,33 +71,33 @@ int nr_irqs = NR_IRQS; EXPORT_SYMBOL_GPL(nr_irqs); #ifdef CONFIG_SPARSE_IRQ + static struct irq_desc irq_desc_init = { .irq = -1, .status = IRQ_DISABLED, .chip = &no_irq_chip, .handle_irq = handle_bad_irq, .depth = 1, - .lock = __SPIN_LOCK_UNLOCKED(irq_desc_init.lock), -#ifdef CONFIG_SMP - .affinity = CPU_MASK_ALL -#endif + .lock = RAW_SPIN_LOCK_UNLOCKED(irq_desc_init.lock), }; void init_kstat_irqs(struct irq_desc *desc, int cpu, int nr) { - unsigned long bytes; - char *ptr; int node; - - /* Compute how many bytes we need per irq and allocate them */ - bytes = nr * sizeof(unsigned int); + void *ptr; node = cpu_to_node(cpu); - ptr = kzalloc_node(bytes, GFP_ATOMIC, node); - printk(KERN_DEBUG " alloc kstat_irqs on cpu %d node %d\n", cpu, node); + ptr = kzalloc_node(nr * sizeof(*desc->kstat_irqs), GFP_ATOMIC, node); - if (ptr) - desc->kstat_irqs = (unsigned int *)ptr; + /* + * don't overwite if can not get new one + * init_copy_kstat_irqs() could still use old one + */ + if (ptr) { + printk(KERN_DEBUG " alloc kstat_irqs on cpu %d node %d\n", + cpu, node); + desc->kstat_irqs = ptr; + } } static void init_one_irq_desc(int irq, struct irq_desc *desc, int cpu) @@ -103,6 +105,7 @@ static void init_one_irq_desc(int irq, s memcpy(desc, &irq_desc_init, sizeof(struct irq_desc)); spin_lock_init(&desc->lock); + init_waitqueue_head(&desc->wait_for_handler); desc->irq = irq; #ifdef CONFIG_SMP desc->cpu = cpu; @@ -113,6 +116,10 @@ static void init_one_irq_desc(int irq, s printk(KERN_ERR "can not alloc kstat_irqs\n"); BUG_ON(1); } + if (!init_alloc_desc_masks(desc, cpu, false)) { + printk(KERN_ERR "can not alloc irq_desc cpumasks\n"); + BUG_ON(1); + } arch_init_chip_data(desc, cpu); } @@ -121,7 +128,7 @@ static void init_one_irq_desc(int irq, s */ DEFINE_SPINLOCK(sparse_irq_lock); -struct irq_desc *irq_desc_ptrs[NR_IRQS] __read_mostly; +struct irq_desc **irq_desc_ptrs __read_mostly; static struct irq_desc irq_desc_legacy[NR_IRQS_LEGACY] __cacheline_aligned_in_smp = { [0 ... NR_IRQS_LEGACY-1] = { @@ -130,15 +137,11 @@ static struct irq_desc irq_desc_legacy[N .chip = &no_irq_chip, .handle_irq = handle_bad_irq, .depth = 1, - .lock = __SPIN_LOCK_UNLOCKED(irq_desc_init.lock), -#ifdef CONFIG_SMP - .affinity = CPU_MASK_ALL -#endif + .lock = RAW_SPIN_LOCK_UNLOCKED(irq_desc_init.lock), } }; -/* FIXME: use bootmem alloc ...*/ -static unsigned int kstat_irqs_legacy[NR_IRQS_LEGACY][NR_CPUS]; +static unsigned int *kstat_irqs_legacy; int __init early_irq_init(void) { @@ -148,18 +151,30 @@ int __init early_irq_init(void) init_irq_default_affinity(); + /* initialize nr_irqs based on nr_cpu_ids */ + arch_probe_nr_irqs(); + printk(KERN_INFO "NR_IRQS:%d nr_irqs:%d\n", NR_IRQS, nr_irqs); + desc = irq_desc_legacy; legacy_count = ARRAY_SIZE(irq_desc_legacy); + /* allocate irq_desc_ptrs array based on nr_irqs */ + irq_desc_ptrs = alloc_bootmem(nr_irqs * sizeof(void *)); + + /* allocate based on nr_cpu_ids */ + /* FIXME: invert kstat_irgs, and it'd be a per_cpu_alloc'd thing */ + kstat_irqs_legacy = alloc_bootmem(NR_IRQS_LEGACY * nr_cpu_ids * + sizeof(int)); + for (i = 0; i < legacy_count; i++) { desc[i].irq = i; - desc[i].kstat_irqs = kstat_irqs_legacy[i]; + desc[i].kstat_irqs = kstat_irqs_legacy + i * nr_cpu_ids; lockdep_set_class(&desc[i].lock, &irq_desc_lock_class); - + init_alloc_desc_masks(&desc[i], 0, true); irq_desc_ptrs[i] = desc + i; } - for (i = legacy_count; i < NR_IRQS; i++) + for (i = legacy_count; i < nr_irqs; i++) irq_desc_ptrs[i] = NULL; return arch_early_irq_init(); @@ -167,7 +182,10 @@ int __init early_irq_init(void) struct irq_desc *irq_to_desc(unsigned int irq) { - return (irq < NR_IRQS) ? irq_desc_ptrs[irq] : NULL; + if (irq_desc_ptrs && irq < nr_irqs) + return irq_desc_ptrs[irq]; + + return NULL; } struct irq_desc *irq_to_desc_alloc_cpu(unsigned int irq, int cpu) @@ -176,10 +194,9 @@ struct irq_desc *irq_to_desc_alloc_cpu(u unsigned long flags; int node; - if (irq >= NR_IRQS) { - printk(KERN_WARNING "irq >= NR_IRQS in irq_to_desc_alloc: %d %d\n", - irq, NR_IRQS); - WARN_ON(1); + if (irq >= nr_irqs) { + WARN(1, "irq (%d) >= nr_irqs (%d) in irq_to_desc_alloc\n", + irq, nr_irqs); return NULL; } @@ -220,13 +237,11 @@ struct irq_desc irq_desc[NR_IRQS] __cach .chip = &no_irq_chip, .handle_irq = handle_bad_irq, .depth = 1, - .lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock), -#ifdef CONFIG_SMP - .affinity = CPU_MASK_ALL -#endif + .lock = RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock), } }; +static unsigned int kstat_irqs_all[NR_IRQS][NR_CPUS]; int __init early_irq_init(void) { struct irq_desc *desc; @@ -235,12 +250,16 @@ int __init early_irq_init(void) init_irq_default_affinity(); + printk(KERN_INFO "NR_IRQS:%d\n", NR_IRQS); + desc = irq_desc; count = ARRAY_SIZE(irq_desc); - for (i = 0; i < count; i++) + for (i = 0; i < count; i++) { desc[i].irq = i; - + init_alloc_desc_masks(&desc[i], 0, true); + desc[i].kstat_irqs = kstat_irqs_all[i]; + } return arch_early_irq_init(); } @@ -255,6 +274,11 @@ struct irq_desc *irq_to_desc_alloc_cpu(u } #endif /* !CONFIG_SPARSE_IRQ */ +void clear_kstat_irqs(struct irq_desc *desc) +{ + memset(desc->kstat_irqs, 0, nr_cpu_ids * sizeof(*(desc->kstat_irqs))); +} + /* * What should we do if we get a hw irq event on an illegal vector? * Each architecture has to answer this themself. @@ -328,24 +352,85 @@ irqreturn_t handle_IRQ_event(unsigned in irqreturn_t ret, retval = IRQ_NONE; unsigned int status = 0; - if (!(action->flags & IRQF_DISABLED)) - local_irq_enable_in_hardirq(); +#ifdef __i386__ + if (debug_direct_keyboard && irq == 1) + lockdep_off(); +#endif + + /* + * Unconditionally enable interrupts for threaded + * IRQ handlers: + */ + if (!hardirq_count() || !(action->flags & IRQF_DISABLED)) + local_irq_enable(); do { + unsigned int preempt_count = preempt_count(); + ret = action->handler(irq, action->dev_id); + if (preempt_count() != preempt_count) { + print_symbol("BUG: unbalanced irq-handler preempt count in %s!\n", (unsigned long) action->handler); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + dump_stack(); + preempt_count() = preempt_count; + } if (ret == IRQ_HANDLED) status |= action->flags; retval |= ret; action = action->next; } while (action); - if (status & IRQF_SAMPLE_RANDOM) + if (status & IRQF_SAMPLE_RANDOM) { + local_irq_enable(); add_interrupt_randomness(irq); + } local_irq_disable(); +#ifdef __i386__ + if (debug_direct_keyboard && irq == 1) + lockdep_on(); +#endif return retval; } +/* + * Hack - used for development only. + */ +int __read_mostly debug_direct_keyboard = 0; + +int __init debug_direct_keyboard_setup(char *str) +{ + debug_direct_keyboard = 1; + printk(KERN_INFO "Switching IRQ 1 (keyboard) to to direct!\n"); +#ifdef CONFIG_PREEMPT_RT + printk(KERN_INFO "WARNING: kernel may easily crash this way!\n"); +#endif + return 1; +} + +__setup("debug_direct_keyboard", debug_direct_keyboard_setup); + +int redirect_hardirq(struct irq_desc *desc) +{ + /* + * Direct execution: + */ + if (!hardirq_preemption || (desc->status & IRQ_NODELAY) || + !desc->thread) + return 0; + +#ifdef __i386__ + if (debug_direct_keyboard && desc->irq == 1) + return 0; +#endif + + BUG_ON(!irqs_disabled()); + if (desc->thread && desc->thread->state != TASK_RUNNING) + wake_up_process(desc->thread); + + return 1; +} + #ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ /** * __do_IRQ - original all in one highlevel IRQ handler @@ -385,6 +470,13 @@ unsigned int __do_IRQ(unsigned int irq) desc->chip->end(irq); return 1; } + /* + * If the task is currently running in user mode, don't + * detect soft lockups. If CONFIG_DETECT_SOFTLOCKUP is not + * configured, this should be optimized out. + */ + if (user_mode(get_irq_regs())) + touch_softlockup_watchdog(); spin_lock(&desc->lock); if (desc->chip->ack) { @@ -467,12 +559,10 @@ void early_init_irq_lock_class(void) } } -#ifdef CONFIG_SPARSE_IRQ unsigned int kstat_irqs_cpu(unsigned int irq, int cpu) { struct irq_desc *desc = irq_to_desc(irq); return desc ? desc->kstat_irqs[cpu] : 0; } -#endif EXPORT_SYMBOL(kstat_irqs_cpu); Index: linux-2.6-tip/kernel/irq/internals.h =================================================================== --- linux-2.6-tip.orig/kernel/irq/internals.h +++ linux-2.6-tip/kernel/irq/internals.h @@ -15,8 +15,20 @@ extern int __irq_set_trigger(struct irq_ extern struct lock_class_key irq_desc_lock_class; extern void init_kstat_irqs(struct irq_desc *desc, int cpu, int nr); +extern void clear_kstat_irqs(struct irq_desc *desc); extern spinlock_t sparse_irq_lock; + +#ifdef CONFIG_SPARSE_IRQ +/* irq_desc_ptrs allocated at boot time */ +extern struct irq_desc **irq_desc_ptrs; +#else +/* irq_desc_ptrs is a fixed size array */ extern struct irq_desc *irq_desc_ptrs[NR_IRQS]; +#endif + +extern int redirect_hardirq(struct irq_desc *desc); + +void recalculate_desc_flags(struct irq_desc *desc); #ifdef CONFIG_PROC_FS extern void register_irq_proc(unsigned int irq, struct irq_desc *desc); Index: linux-2.6-tip/kernel/irq/manage.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/manage.c +++ linux-2.6-tip/kernel/irq/manage.c @@ -8,8 +8,10 @@ */ #include -#include #include +#include +#include +#include #include #include @@ -43,8 +45,12 @@ void synchronize_irq(unsigned int irq) * Wait until we're out of the critical section. This might * give the wrong answer due to the lack of memory barriers. */ - while (desc->status & IRQ_INPROGRESS) - cpu_relax(); + if (hardirq_preemption && !(desc->status & IRQ_NODELAY)) + wait_event(desc->wait_for_handler, + !(desc->status & IRQ_INPROGRESS)); + else + while (desc->status & IRQ_INPROGRESS) + cpu_relax(); /* Ok, that indicated we're done: double-check carefully. */ spin_lock_irqsave(&desc->lock, flags); @@ -90,14 +96,14 @@ int irq_set_affinity(unsigned int irq, c #ifdef CONFIG_GENERIC_PENDING_IRQ if (desc->status & IRQ_MOVE_PCNTXT || desc->status & IRQ_DISABLED) { - cpumask_copy(&desc->affinity, cpumask); + cpumask_copy(desc->affinity, cpumask); desc->chip->set_affinity(irq, cpumask); } else { desc->status |= IRQ_MOVE_PENDING; - cpumask_copy(&desc->pending_mask, cpumask); + cpumask_copy(desc->pending_mask, cpumask); } #else - cpumask_copy(&desc->affinity, cpumask); + cpumask_copy(desc->affinity, cpumask); desc->chip->set_affinity(irq, cpumask); #endif desc->status |= IRQ_AFFINITY_SET; @@ -109,7 +115,7 @@ int irq_set_affinity(unsigned int irq, c /* * Generic version of the affinity autoselector. */ -int do_irq_select_affinity(unsigned int irq, struct irq_desc *desc) +static int setup_affinity(unsigned int irq, struct irq_desc *desc) { if (!irq_can_set_affinity(irq)) return 0; @@ -119,21 +125,21 @@ int do_irq_select_affinity(unsigned int * one of the targets is online. */ if (desc->status & (IRQ_AFFINITY_SET | IRQ_NO_BALANCING)) { - if (cpumask_any_and(&desc->affinity, cpu_online_mask) + if (cpumask_any_and(desc->affinity, cpu_online_mask) < nr_cpu_ids) goto set_affinity; else desc->status &= ~IRQ_AFFINITY_SET; } - cpumask_and(&desc->affinity, cpu_online_mask, irq_default_affinity); + cpumask_and(desc->affinity, cpu_online_mask, irq_default_affinity); set_affinity: - desc->chip->set_affinity(irq, &desc->affinity); + desc->chip->set_affinity(irq, desc->affinity); return 0; } #else -static inline int do_irq_select_affinity(unsigned int irq, struct irq_desc *d) +static inline int setup_affinity(unsigned int irq, struct irq_desc *d) { return irq_select_affinity(irq); } @@ -149,14 +155,14 @@ int irq_select_affinity_usr(unsigned int int ret; spin_lock_irqsave(&desc->lock, flags); - ret = do_irq_select_affinity(irq, desc); + ret = setup_affinity(irq, desc); spin_unlock_irqrestore(&desc->lock, flags); return ret; } #else -static inline int do_irq_select_affinity(int irq, struct irq_desc *desc) +static inline int setup_affinity(unsigned int irq, struct irq_desc *desc) { return 0; } @@ -255,6 +261,14 @@ void enable_irq(unsigned int irq) spin_lock_irqsave(&desc->lock, flags); __enable_irq(desc, irq); spin_unlock_irqrestore(&desc->lock, flags); +#ifdef CONFIG_HARDIRQS_SW_RESEND + /* + * Do a bh disable/enable pair to trigger any pending + * irq resend logic: + */ + local_bh_disable(); + local_bh_enable(); +#endif } EXPORT_SYMBOL(enable_irq); @@ -317,6 +331,21 @@ int set_irq_wake(unsigned int irq, unsig EXPORT_SYMBOL(set_irq_wake); /* + * If any action has IRQF_NODELAY then turn IRQ_NODELAY on: + */ +void recalculate_desc_flags(struct irq_desc *desc) +{ + struct irqaction *action; + + desc->status &= ~IRQ_NODELAY; + for (action = desc->action ; action; action = action->next) + if (action->flags & IRQF_NODELAY) + desc->status |= IRQ_NODELAY; +} + +static int start_irq_thread(int irq, struct irq_desc *desc); + +/* * Internal function that tells the architecture code whether a * particular irq has been exclusively allocated or is available * for driver use. @@ -389,9 +418,9 @@ int __irq_set_trigger(struct irq_desc *d * allocate special interrupts that are part of the architecture. */ static int -__setup_irq(unsigned int irq, struct irq_desc * desc, struct irqaction *new) +__setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) { - struct irqaction *old, **p; + struct irqaction *old, **old_ptr; const char *old_name = NULL; unsigned long flags; int shared = 0; @@ -419,12 +448,15 @@ __setup_irq(unsigned int irq, struct irq rand_initialize_irq(irq); } + if (!(new->flags & IRQF_NODELAY)) + if (start_irq_thread(irq, desc)) + return -ENOMEM; /* * The following block of code has to be executed atomically */ spin_lock_irqsave(&desc->lock, flags); - p = &desc->action; - old = *p; + old_ptr = &desc->action; + old = *old_ptr; if (old) { /* * Can't share interrupts unless both agree to and are @@ -447,8 +479,8 @@ __setup_irq(unsigned int irq, struct irq /* add new interrupt at end of irq queue */ do { - p = &old->next; - old = *p; + old_ptr = &old->next; + old = *old_ptr; } while (old); shared = 1; } @@ -488,7 +520,7 @@ __setup_irq(unsigned int irq, struct irq desc->status |= IRQ_NO_BALANCING; /* Set default affinity mask once everything is setup */ - do_irq_select_affinity(irq, desc); + setup_affinity(irq, desc); } else if ((new->flags & IRQF_TRIGGER_MASK) && (new->flags & IRQF_TRIGGER_MASK) @@ -499,11 +531,17 @@ __setup_irq(unsigned int irq, struct irq (int)(new->flags & IRQF_TRIGGER_MASK)); } - *p = new; + *old_ptr = new; + + /* + * Propagate any possible IRQF_NODELAY flag into IRQ_NODELAY: + */ + recalculate_desc_flags(desc); /* Reset broken irq detection when installing new handler */ desc->irq_count = 0; desc->irqs_unhandled = 0; + init_waitqueue_head(&desc->wait_for_handler); /* * Check whether we disabled the irq via the spurious handler @@ -518,7 +556,7 @@ __setup_irq(unsigned int irq, struct irq new->irq = irq; register_irq_proc(irq, desc); - new->dir = NULL; + new->dir = new->threaded = NULL; register_handler_proc(irq, new); return 0; @@ -567,72 +605,77 @@ int setup_irq(unsigned int irq, struct i void free_irq(unsigned int irq, void *dev_id) { struct irq_desc *desc = irq_to_desc(irq); - struct irqaction **p; + struct irqaction *action, **action_ptr; unsigned long flags; - WARN_ON(in_interrupt()); + WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq); if (!desc) return; spin_lock_irqsave(&desc->lock, flags); - p = &desc->action; + + /* + * There can be multiple actions per IRQ descriptor, find the right + * one based on the dev_id: + */ + action_ptr = &desc->action; for (;;) { - struct irqaction *action = *p; + action = *action_ptr; - if (action) { - struct irqaction **pp = p; + if (!action) { + WARN(1, "Trying to free already-free IRQ %d\n", irq); + spin_unlock_irqrestore(&desc->lock, flags); - p = &action->next; - if (action->dev_id != dev_id) - continue; + return; + } + + if (action->dev_id == dev_id) + break; + action_ptr = &action->next; + } - /* Found it - now remove it from the list of entries */ - *pp = action->next; + /* Found it - now remove it from the list of entries: */ + *action_ptr = action->next; - /* Currently used only by UML, might disappear one day.*/ + /* Currently used only by UML, might disappear one day: */ #ifdef CONFIG_IRQ_RELEASE_METHOD - if (desc->chip->release) - desc->chip->release(irq, dev_id); + if (desc->chip->release) + desc->chip->release(irq, dev_id); #endif - if (!desc->action) { - desc->status |= IRQ_DISABLED; - if (desc->chip->shutdown) - desc->chip->shutdown(irq); - else - desc->chip->disable(irq); - } - spin_unlock_irqrestore(&desc->lock, flags); - unregister_handler_proc(irq, action); + /* If this was the last handler, shut down the IRQ line: */ + if (!desc->action) { + desc->status |= IRQ_DISABLED; + if (desc->chip->shutdown) + desc->chip->shutdown(irq); + else + desc->chip->disable(irq); + } + recalculate_desc_flags(desc); + spin_unlock_irqrestore(&desc->lock, flags); + + unregister_handler_proc(irq, action); + + /* Make sure it's not being used on another CPU: */ + synchronize_irq(irq); - /* Make sure it's not being used on another CPU */ - synchronize_irq(irq); -#ifdef CONFIG_DEBUG_SHIRQ - /* - * It's a shared IRQ -- the driver ought to be - * prepared for it to happen even now it's - * being freed, so let's make sure.... We do - * this after actually deregistering it, to - * make sure that a 'real' IRQ doesn't run in - * parallel with our fake - */ - if (action->flags & IRQF_SHARED) { - local_irq_save(flags); - action->handler(irq, dev_id); - local_irq_restore(flags); - } -#endif - kfree(action); - return; - } - printk(KERN_ERR "Trying to free already-free IRQ %d\n", irq); #ifdef CONFIG_DEBUG_SHIRQ - dump_stack(); -#endif - spin_unlock_irqrestore(&desc->lock, flags); - return; + /* + * It's a shared IRQ -- the driver ought to be prepared for an IRQ + * event to happen even now it's being freed, so let's make sure that + * is so by doing an extra call to the handler .... + * + * ( We do this after actually deregistering it, to make sure that a + * 'real' IRQ doesn't run in * parallel with our fake. ) + */ + if (action->flags & IRQF_SHARED) { + local_irq_save_nort(flags); + action->handler(irq, dev_id); + local_irq_restore_nort(flags); } +#endif + kfree(action); } EXPORT_SYMBOL(free_irq); @@ -679,11 +722,12 @@ int request_irq(unsigned int irq, irq_ha * the behavior is classified as "will not fix" so we need to * start nudging drivers away from using that idiom. */ - if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) - == (IRQF_SHARED|IRQF_DISABLED)) - pr_warning("IRQ %d/%s: IRQF_DISABLED is not " - "guaranteed on shared IRQs\n", - irq, devname); + if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) == + (IRQF_SHARED|IRQF_DISABLED)) { + pr_warning( + "IRQ %d/%s: IRQF_DISABLED is not guaranteed on shared IRQs\n", + irq, devname); + } #ifdef CONFIG_LOCKDEP /* @@ -709,7 +753,7 @@ int request_irq(unsigned int irq, irq_ha if (!handler) return -EINVAL; - action = kmalloc(sizeof(struct irqaction), GFP_ATOMIC); + action = kmalloc(sizeof(struct irqaction), GFP_KERNEL); if (!action) return -ENOMEM; @@ -735,14 +779,289 @@ int request_irq(unsigned int irq, irq_ha unsigned long flags; disable_irq(irq); - local_irq_save(flags); + local_irq_save_nort(flags); handler(irq, dev_id); - local_irq_restore(flags); + local_irq_restore_nort(flags); enable_irq(irq); } #endif return retval; } EXPORT_SYMBOL(request_irq); + +#ifdef CONFIG_PREEMPT_HARDIRQS + +int hardirq_preemption = 1; + +EXPORT_SYMBOL(hardirq_preemption); + +/* + * Real-Time Preemption depends on hardirq threading: + */ +#ifndef CONFIG_PREEMPT_RT + +static int __init hardirq_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + hardirq_preemption = 0; + else + get_option(&str, &hardirq_preemption); + if (!hardirq_preemption) + printk("turning off hardirq preemption!\n"); + + return 1; +} + +__setup("hardirq-preempt=", hardirq_preempt_setup); + +#endif + +/* + * threaded simple handler + */ +static void thread_simple_irq(irq_desc_t *desc) +{ + struct irqaction *action = desc->action; + unsigned int irq = desc->irq; + irqreturn_t action_ret; + + do { + if (!action || desc->depth) + break; + desc->status &= ~IRQ_PENDING; + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } while (desc->status & IRQ_PENDING); + desc->status &= ~IRQ_INPROGRESS; +} + +/* + * threaded level type irq handler + */ +static void thread_level_irq(irq_desc_t *desc) +{ + unsigned int irq = desc->irq; + + thread_simple_irq(desc); + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); +} + +/* + * threaded fasteoi type irq handler + */ +static void thread_fasteoi_irq(irq_desc_t *desc) +{ + unsigned int irq = desc->irq; + + thread_simple_irq(desc); + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); +} + +/* + * threaded edge type IRQ handler + */ +static void thread_edge_irq(irq_desc_t *desc) +{ + unsigned int irq = desc->irq; + + do { + struct irqaction *action = desc->action; + irqreturn_t action_ret; + + if (unlikely(!action)) { + desc->status &= ~IRQ_INPROGRESS; + desc->chip->mask(irq); + return; + } + + /* + * When another irq arrived while we were handling + * one, we could have masked the irq. + * Renable it, if it was not disabled in meantime. + */ + if (unlikely(((desc->status & (IRQ_PENDING | IRQ_MASKED)) == + (IRQ_PENDING | IRQ_MASKED)) && !desc->depth)) + desc->chip->unmask(irq); + + desc->status &= ~IRQ_PENDING; + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } while ((desc->status & IRQ_PENDING) && !desc->depth); + + desc->status &= ~IRQ_INPROGRESS; +} + +/* + * threaded edge type IRQ handler + */ +static void thread_do_irq(irq_desc_t *desc) +{ + unsigned int irq = desc->irq; + + do { + struct irqaction *action = desc->action; + irqreturn_t action_ret; + + if (unlikely(!action)) { + desc->status &= ~IRQ_INPROGRESS; + desc->chip->disable(irq); + return; + } + + desc->status &= ~IRQ_PENDING; + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } while ((desc->status & IRQ_PENDING) && !desc->depth); + + desc->status &= ~IRQ_INPROGRESS; + desc->chip->end(irq); +} + +static void do_hardirq(struct irq_desc *desc) +{ + unsigned long flags; + + spin_lock_irqsave(&desc->lock, flags); + + if (!(desc->status & IRQ_INPROGRESS)) + goto out; + + if (desc->handle_irq == handle_simple_irq) + thread_simple_irq(desc); + else if (desc->handle_irq == handle_level_irq) + thread_level_irq(desc); + else if (desc->handle_irq == handle_fasteoi_irq) + thread_fasteoi_irq(desc); + else if (desc->handle_irq == handle_edge_irq) + thread_edge_irq(desc); + else + thread_do_irq(desc); + out: + spin_unlock_irqrestore(&desc->lock, flags); + + if (waitqueue_active(&desc->wait_for_handler)) + wake_up(&desc->wait_for_handler); +} + +extern asmlinkage void __do_softirq(void); + +static int do_irqd(void * __desc) +{ + struct sched_param param = { 0, }; + struct irq_desc *desc = __desc; + +#ifdef CONFIG_SMP + set_cpus_allowed_ptr(current, desc->affinity); +#endif + current->flags |= PF_NOFREEZE | PF_HARDIRQ; + + /* + * Set irq thread priority to SCHED_FIFO/50: + */ + param.sched_priority = MAX_USER_RT_PRIO/2; + + sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); + + while (!kthread_should_stop()) { + local_irq_disable_nort(); + set_current_state(TASK_INTERRUPTIBLE); +#ifndef CONFIG_PREEMPT_RT + irq_enter(); +#endif + do_hardirq(desc); +#ifndef CONFIG_PREEMPT_RT + irq_exit(); +#endif + local_irq_enable_nort(); + cond_resched(); +#ifdef CONFIG_SMP + /* + * Did IRQ affinities change? + */ + if (!cpumask_equal(¤t->cpus_allowed, desc->affinity)) + set_cpus_allowed_ptr(current, desc->affinity); +#endif + schedule(); + } + __set_current_state(TASK_RUNNING); + + return 0; +} + +static int ok_to_create_irq_threads; + +static int start_irq_thread(int irq, struct irq_desc *desc) +{ + if (desc->thread || !ok_to_create_irq_threads) + return 0; + + init_waitqueue_head(&desc->wait_for_handler); + + desc->thread = kthread_create(do_irqd, desc, "IRQ-%d", irq); + if (!desc->thread) { + printk(KERN_ERR "irqd: could not create IRQ thread %d!\n", irq); + return -ENOMEM; + } + + /* + * An interrupt may have come in before the thread pointer was + * stored in desc->thread; make sure the thread gets woken up in + * such a case: + */ + smp_mb(); + wake_up_process(desc->thread); + + return 0; +} + +/* + * Start hardirq threads for all IRQs that are registered already. + * + * New ones will be started at the time of IRQ setup from now on. + */ +void __init init_hardirqs(void) +{ + struct irq_desc *desc; + int irq; + + ok_to_create_irq_threads = 1; + + for_each_irq_desc(irq, desc) { + if (desc->action && !(desc->status & IRQ_NODELAY)) + start_irq_thread(irq, desc); + } +} + +#else + +static int start_irq_thread(int irq, struct irq_desc *desc) +{ + return 0; +} + +#endif + +void __init early_init_hardirqs(void) +{ + struct irq_desc *desc; + int i; + + for_each_irq_desc(i, desc) + init_waitqueue_head(&desc->wait_for_handler); +} Index: linux-2.6-tip/kernel/irq/migration.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/migration.c +++ linux-2.6-tip/kernel/irq/migration.c @@ -18,7 +18,7 @@ void move_masked_irq(int irq) desc->status &= ~IRQ_MOVE_PENDING; - if (unlikely(cpumask_empty(&desc->pending_mask))) + if (unlikely(cpumask_empty(desc->pending_mask))) return; if (!desc->chip->set_affinity) @@ -38,18 +38,19 @@ void move_masked_irq(int irq) * For correct operation this depends on the caller * masking the irqs. */ - if (likely(cpumask_any_and(&desc->pending_mask, cpu_online_mask) + if (likely(cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids)) { - cpumask_and(&desc->affinity, - &desc->pending_mask, cpu_online_mask); - desc->chip->set_affinity(irq, &desc->affinity); + cpumask_and(desc->affinity, + desc->pending_mask, cpu_online_mask); + desc->chip->set_affinity(irq, desc->affinity); } - cpumask_clear(&desc->pending_mask); + cpumask_clear(desc->pending_mask); } void move_native_irq(int irq) { struct irq_desc *desc = irq_to_desc(irq); + int mask = 1; if (likely(!(desc->status & IRQ_MOVE_PENDING))) return; @@ -57,8 +58,17 @@ void move_native_irq(int irq) if (unlikely(desc->status & IRQ_DISABLED)) return; - desc->chip->mask(irq); + /* + * If the irq is already in progress, it should be masked. + * If we unmask it, we might cause an interrupt storm on RT. + */ + if (unlikely(desc->status & IRQ_INPROGRESS)) + mask = 0; + + if (mask) + desc->chip->mask(irq); move_masked_irq(irq); - desc->chip->unmask(irq); + if (mask) + desc->chip->unmask(irq); } Index: linux-2.6-tip/kernel/irq/numa_migrate.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/numa_migrate.c +++ linux-2.6-tip/kernel/irq/numa_migrate.c @@ -17,16 +17,11 @@ static void init_copy_kstat_irqs(struct struct irq_desc *desc, int cpu, int nr) { - unsigned long bytes; - init_kstat_irqs(desc, cpu, nr); - if (desc->kstat_irqs != old_desc->kstat_irqs) { - /* Compute how many bytes we need per irq and allocate them */ - bytes = nr * sizeof(unsigned int); - - memcpy(desc->kstat_irqs, old_desc->kstat_irqs, bytes); - } + if (desc->kstat_irqs != old_desc->kstat_irqs) + memcpy(desc->kstat_irqs, old_desc->kstat_irqs, + nr * sizeof(*desc->kstat_irqs)); } static void free_kstat_irqs(struct irq_desc *old_desc, struct irq_desc *desc) @@ -38,15 +33,23 @@ static void free_kstat_irqs(struct irq_d old_desc->kstat_irqs = NULL; } -static void init_copy_one_irq_desc(int irq, struct irq_desc *old_desc, +static bool init_copy_one_irq_desc(int irq, struct irq_desc *old_desc, struct irq_desc *desc, int cpu) { memcpy(desc, old_desc, sizeof(struct irq_desc)); + if (!init_alloc_desc_masks(desc, cpu, false)) { + printk(KERN_ERR "irq %d: can not get new irq_desc cpumask " + "for migration.\n", irq); + return false; + } spin_lock_init(&desc->lock); + init_waitqueue_head(&desc->wait_for_handler); desc->cpu = cpu; lockdep_set_class(&desc->lock, &irq_desc_lock_class); init_copy_kstat_irqs(old_desc, desc, cpu, nr_cpu_ids); + init_copy_desc_masks(old_desc, desc); arch_init_copy_chip_data(old_desc, desc, cpu); + return true; } static void free_one_irq_desc(struct irq_desc *old_desc, struct irq_desc *desc) @@ -76,12 +79,18 @@ static struct irq_desc *__real_move_irq_ node = cpu_to_node(cpu); desc = kzalloc_node(sizeof(*desc), GFP_ATOMIC, node); if (!desc) { - printk(KERN_ERR "irq %d: can not get new irq_desc for migration.\n", irq); + printk(KERN_ERR "irq %d: can not get new irq_desc " + "for migration.\n", irq); + /* still use old one */ + desc = old_desc; + goto out_unlock; + } + if (!init_copy_one_irq_desc(irq, old_desc, desc, cpu)) { /* still use old one */ + kfree(desc); desc = old_desc; goto out_unlock; } - init_copy_one_irq_desc(irq, old_desc, desc, cpu); irq_desc_ptrs[irq] = desc; spin_unlock_irqrestore(&sparse_irq_lock, flags); Index: linux-2.6-tip/kernel/irq/proc.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/proc.c +++ linux-2.6-tip/kernel/irq/proc.c @@ -7,6 +7,8 @@ */ #include +#include +#include #include #include #include @@ -20,11 +22,11 @@ static struct proc_dir_entry *root_irq_d static int irq_affinity_proc_show(struct seq_file *m, void *v) { struct irq_desc *desc = irq_to_desc((long)m->private); - const struct cpumask *mask = &desc->affinity; + const struct cpumask *mask = desc->affinity; #ifdef CONFIG_GENERIC_PENDING_IRQ if (desc->status & IRQ_MOVE_PENDING) - mask = &desc->pending_mask; + mask = desc->pending_mask; #endif seq_cpumask(m, mask); seq_putc(m, '\n'); @@ -116,6 +118,9 @@ static ssize_t default_affinity_write(st goto out; } + /* create /proc/irq/prof_cpu_mask */ + create_prof_cpu_mask(root_irq_dir); + /* * Do not allow disabling IRQs completely - it's a too easy * way to make the system unusable accidentally :-) At least @@ -160,45 +165,6 @@ static int irq_spurious_read(char *page, jiffies_to_msecs(desc->last_unhandled)); } -#define MAX_NAMELEN 128 - -static int name_unique(unsigned int irq, struct irqaction *new_action) -{ - struct irq_desc *desc = irq_to_desc(irq); - struct irqaction *action; - unsigned long flags; - int ret = 1; - - spin_lock_irqsave(&desc->lock, flags); - for (action = desc->action ; action; action = action->next) { - if ((action != new_action) && action->name && - !strcmp(new_action->name, action->name)) { - ret = 0; - break; - } - } - spin_unlock_irqrestore(&desc->lock, flags); - return ret; -} - -void register_handler_proc(unsigned int irq, struct irqaction *action) -{ - char name [MAX_NAMELEN]; - struct irq_desc *desc = irq_to_desc(irq); - - if (!desc->dir || action->dir || !action->name || - !name_unique(irq, action)) - return; - - memset(name, 0, MAX_NAMELEN); - snprintf(name, MAX_NAMELEN, "%s", action->name); - - /* create /proc/irq/1234/handler/ */ - action->dir = proc_mkdir(name, desc->dir); -} - -#undef MAX_NAMELEN - #define MAX_NAMELEN 10 void register_irq_proc(unsigned int irq, struct irq_desc *desc) @@ -232,6 +198,8 @@ void register_irq_proc(unsigned int irq, void unregister_handler_proc(unsigned int irq, struct irqaction *action) { + if (action->threaded) + remove_proc_entry(action->threaded->name, action->dir); if (action->dir) { struct irq_desc *desc = irq_to_desc(irq); @@ -247,6 +215,91 @@ static void register_default_affinity_pr #endif } +#ifndef CONFIG_PREEMPT_RT + +static int threaded_read_proc(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + return sprintf(page, "%c\n", + ((struct irqaction *)data)->flags & IRQF_NODELAY ? '0' : '1'); +} + +static int threaded_write_proc(struct file *file, const char __user *buffer, + unsigned long count, void *data) +{ + int c; + struct irqaction *action = data; + irq_desc_t *desc = irq_to_desc(action->irq); + + if (get_user(c, buffer)) + return -EFAULT; + if (c != '0' && c != '1') + return -EINVAL; + + spin_lock_irq(&desc->lock); + + if (c == '0') + action->flags |= IRQF_NODELAY; + if (c == '1') + action->flags &= ~IRQF_NODELAY; + recalculate_desc_flags(desc); + + spin_unlock_irq(&desc->lock); + + return 1; +} + +#endif + +#define MAX_NAMELEN 128 + +static int name_unique(unsigned int irq, struct irqaction *new_action) +{ + struct irq_desc *desc = irq_to_desc(irq); + struct irqaction *action; + + for (action = desc->action ; action; action = action->next) + if ((action != new_action) && action->name && + !strcmp(new_action->name, action->name)) + return 0; + return 1; +} + +void register_handler_proc(unsigned int irq, struct irqaction *action) +{ + char name [MAX_NAMELEN]; + struct irq_desc *desc = irq_to_desc(irq); + + if (!desc->dir || action->dir || !action->name || + !name_unique(irq, action)) + return; + + memset(name, 0, MAX_NAMELEN); + snprintf(name, MAX_NAMELEN, "%s", action->name); + + /* create /proc/irq/1234/handler/ */ + action->dir = proc_mkdir(name, desc->dir); + + if (!action->dir) + return; +#ifndef CONFIG_PREEMPT_RT + { + struct proc_dir_entry *entry; + /* create /proc/irq/1234/handler/threaded */ + entry = create_proc_entry("threaded", 0600, action->dir); + if (!entry) + return; + entry->nlink = 1; + entry->data = (void *)action; + entry->read_proc = threaded_read_proc; + entry->write_proc = threaded_write_proc; + action->threaded = entry; + } +#endif +} + +#undef MAX_NAMELEN + void init_irq_proc(void) { unsigned int irq; Index: linux-2.6-tip/kernel/irq/spurious.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/spurious.c +++ linux-2.6-tip/kernel/irq/spurious.c @@ -14,6 +14,11 @@ #include #include +#ifdef CONFIG_X86_IO_APIC +# include +# include +#endif + static int irqfixup __read_mostly; #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) @@ -54,9 +59,8 @@ static int try_one_irq(int irq, struct i } action = action->next; } - local_irq_disable(); /* Now clean up the flags */ - spin_lock(&desc->lock); + spin_lock_irq(&desc->lock); action = desc->action; /* @@ -104,7 +108,7 @@ static int misrouted_irq(int irq) return ok; } -static void poll_spurious_irqs(unsigned long dummy) +static void poll_all_shared_irqs(void) { struct irq_desc *desc; int i; @@ -123,11 +127,23 @@ static void poll_spurious_irqs(unsigned try_one_irq(i, desc); } +} + +static void poll_spurious_irqs(unsigned long dummy) +{ + poll_all_shared_irqs(); mod_timer(&poll_spurious_irq_timer, jiffies + POLL_SPURIOUS_IRQ_INTERVAL); } +#ifdef CONFIG_DEBUG_SHIRQ +void debug_poll_all_shared_irqs(void) +{ + poll_all_shared_irqs(); +} +#endif + /* * If 99,900 of the previous 100,000 interrupts have not been handled * then assume that the IRQ is stuck in some manner. Drop a diagnostic @@ -246,6 +262,12 @@ void note_interrupt(unsigned int irq, st * The interrupt is stuck */ __report_bad_irq(irq, desc, action_ret); +#ifdef CONFIG_X86_IO_APIC + if (!sis_apic_bug) { + sis_apic_bug = 1; + printk(KERN_ERR "turning off IO-APIC fast mode.\n"); + } +#else /* * Now kill the IRQ */ @@ -256,6 +278,7 @@ void note_interrupt(unsigned int irq, st mod_timer(&poll_spurious_irq_timer, jiffies + POLL_SPURIOUS_IRQ_INTERVAL); +#endif } desc->irqs_unhandled = 0; } @@ -276,6 +299,11 @@ MODULE_PARM_DESC(noirqdebug, "Disable ir static int __init irqfixup_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + printk(KERN_WARNING "irqfixup boot option not supported " + "w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 1; printk(KERN_WARNING "Misrouted IRQ fixup support enabled.\n"); printk(KERN_WARNING "This may impact system performance.\n"); @@ -289,6 +317,11 @@ MODULE_PARM_DESC("irqfixup", "0: No fixu static int __init irqpoll_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + printk(KERN_WARNING "irqpoll boot option not supported " + "w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 2; printk(KERN_WARNING "Misrouted IRQ fixup and polling support " "enabled\n"); Index: linux-2.6-tip/kernel/kexec.c =================================================================== --- linux-2.6-tip.orig/kernel/kexec.c +++ linux-2.6-tip/kernel/kexec.c @@ -1130,7 +1130,7 @@ void crash_save_cpu(struct pt_regs *regs return; memset(&prstatus, 0, sizeof(prstatus)); prstatus.pr_pid = current->pid; - elf_core_copy_regs(&prstatus.pr_reg, regs); + elf_core_copy_kernel_regs(&prstatus.pr_reg, regs); buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS, &prstatus, sizeof(prstatus)); final_note(buf); Index: linux-2.6-tip/kernel/kprobes.c =================================================================== --- linux-2.6-tip.orig/kernel/kprobes.c +++ linux-2.6-tip/kernel/kprobes.c @@ -72,10 +72,10 @@ static bool kprobe_enabled; static DEFINE_MUTEX(kprobe_mutex); /* Protects kprobe_table */ static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL; static struct { - spinlock_t lock ____cacheline_aligned_in_smp; + raw_spinlock_t lock ____cacheline_aligned_in_smp; } kretprobe_table_locks[KPROBE_TABLE_SIZE]; -static spinlock_t *kretprobe_table_lock_ptr(unsigned long hash) +static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned long hash) { return &(kretprobe_table_locks[hash].lock); } @@ -123,7 +123,7 @@ static int collect_garbage_slots(void); static int __kprobes check_safety(void) { int ret = 0; -#if defined(CONFIG_PREEMPT) && defined(CONFIG_FREEZER) +#if defined(CONFIG_PREEMPT) && defined(CONFIG_PM) ret = freeze_processes(); if (ret == 0) { struct task_struct *p, *q; @@ -414,7 +414,7 @@ void __kprobes kretprobe_hash_lock(struc struct hlist_head **head, unsigned long *flags) { unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS); - spinlock_t *hlist_lock; + raw_spinlock_t *hlist_lock; *head = &kretprobe_inst_table[hash]; hlist_lock = kretprobe_table_lock_ptr(hash); @@ -424,7 +424,7 @@ void __kprobes kretprobe_hash_lock(struc static void __kprobes kretprobe_table_lock(unsigned long hash, unsigned long *flags) { - spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); + raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); spin_lock_irqsave(hlist_lock, *flags); } @@ -432,7 +432,7 @@ void __kprobes kretprobe_hash_unlock(str unsigned long *flags) { unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS); - spinlock_t *hlist_lock; + raw_spinlock_t *hlist_lock; hlist_lock = kretprobe_table_lock_ptr(hash); spin_unlock_irqrestore(hlist_lock, *flags); @@ -440,7 +440,7 @@ void __kprobes kretprobe_hash_unlock(str void __kprobes kretprobe_table_unlock(unsigned long hash, unsigned long *flags) { - spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); + raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash); spin_unlock_irqrestore(hlist_lock, *flags); } Index: linux-2.6-tip/kernel/latencytop.c =================================================================== --- linux-2.6-tip.orig/kernel/latencytop.c +++ linux-2.6-tip/kernel/latencytop.c @@ -9,6 +9,44 @@ * as published by the Free Software Foundation; version 2 * of the License. */ + +/* + * CONFIG_LATENCYTOP enables a kernel latency tracking infrastructure that is + * used by the "latencytop" userspace tool. The latency that is tracked is not + * the 'traditional' interrupt latency (which is primarily caused by something + * else consuming CPU), but instead, it is the latency an application encounters + * because the kernel sleeps on its behalf for various reasons. + * + * This code tracks 2 levels of statistics: + * 1) System level latency + * 2) Per process latency + * + * The latency is stored in fixed sized data structures in an accumulated form; + * if the "same" latency cause is hit twice, this will be tracked as one entry + * in the data structure. Both the count, total accumulated latency and maximum + * latency are tracked in this data structure. When the fixed size structure is + * full, no new causes are tracked until the buffer is flushed by writing to + * the /proc file; the userspace tool does this on a regular basis. + * + * A latency cause is identified by a stringified backtrace at the point that + * the scheduler gets invoked. The userland tool will use this string to + * identify the cause of the latency in human readable form. + * + * The information is exported via /proc/latency_stats and /proc//latency. + * These files look like this: + * + * Latency Top version : v0.1 + * 70 59433 4897 i915_irq_wait drm_ioctl vfs_ioctl do_vfs_ioctl sys_ioctl + * | | | | + * | | | +----> the stringified backtrace + * | | +---------> The maximum latency for this entry in microseconds + * | +--------------> The accumulated latency for this entry (microseconds) + * +-------------------> The number of times this entry is hit + * + * (note: the average latency is the accumulated latency divided by the number + * of times) + */ + #include #include #include @@ -72,7 +110,7 @@ account_global_scheduler_latency(struct firstnonnull = i; continue; } - for (q = 0 ; q < LT_BACKTRACEDEPTH ; q++) { + for (q = 0; q < LT_BACKTRACEDEPTH; q++) { unsigned long record = lat->backtrace[q]; if (latency_record[i].backtrace[q] != record) { @@ -101,31 +139,52 @@ account_global_scheduler_latency(struct memcpy(&latency_record[i], lat, sizeof(struct latency_record)); } -static inline void store_stacktrace(struct task_struct *tsk, struct latency_record *lat) +/* + * Iterator to store a backtrace into a latency record entry + */ +static inline void store_stacktrace(struct task_struct *tsk, + struct latency_record *lat) { struct stack_trace trace; memset(&trace, 0, sizeof(trace)); trace.max_entries = LT_BACKTRACEDEPTH; trace.entries = &lat->backtrace[0]; - trace.skip = 0; save_stack_trace_tsk(tsk, &trace); } +/** + * __account_scheduler_latency - record an occured latency + * @tsk - the task struct of the task hitting the latency + * @usecs - the duration of the latency in microseconds + * @inter - 1 if the sleep was interruptible, 0 if uninterruptible + * + * This function is the main entry point for recording latency entries + * as called by the scheduler. + * + * This function has a few special cases to deal with normal 'non-latency' + * sleeps: specifically, interruptible sleep longer than 5 msec is skipped + * since this usually is caused by waiting for events via select() and co. + * + * Negative latencies (caused by time going backwards) are also explicitly + * skipped. + */ void __sched -account_scheduler_latency(struct task_struct *tsk, int usecs, int inter) +__account_scheduler_latency(struct task_struct *tsk, int usecs, int inter) { unsigned long flags; int i, q; struct latency_record lat; - if (!latencytop_enabled) - return; - /* Long interruptible waits are generally user requested... */ if (inter && usecs > 5000) return; + /* Negative sleeps are time going backwards */ + /* Zero-time sleeps are non-interesting */ + if (usecs <= 0) + return; + memset(&lat, 0, sizeof(lat)); lat.count = 1; lat.time = usecs; @@ -143,12 +202,12 @@ account_scheduler_latency(struct task_st if (tsk->latency_record_count >= LT_SAVECOUNT) goto out_unlock; - for (i = 0; i < LT_SAVECOUNT ; i++) { + for (i = 0; i < LT_SAVECOUNT; i++) { struct latency_record *mylat; int same = 1; mylat = &tsk->latency_record[i]; - for (q = 0 ; q < LT_BACKTRACEDEPTH ; q++) { + for (q = 0; q < LT_BACKTRACEDEPTH; q++) { unsigned long record = lat.backtrace[q]; if (mylat->backtrace[q] != record) { @@ -186,7 +245,7 @@ static int lstats_show(struct seq_file * for (i = 0; i < MAXLR; i++) { if (latency_record[i].backtrace[0]) { int q; - seq_printf(m, "%i %li %li ", + seq_printf(m, "%i %lu %lu ", latency_record[i].count, latency_record[i].time, latency_record[i].max); @@ -223,7 +282,7 @@ static int lstats_open(struct inode *ino return single_open(filp, lstats_show, NULL); } -static struct file_operations lstats_fops = { +static const struct file_operations lstats_fops = { .open = lstats_open, .read = seq_read, .write = lstats_write, @@ -236,4 +295,4 @@ static int __init init_lstats_procfs(voi proc_create("latency_stats", 0644, NULL, &lstats_fops); return 0; } -__initcall(init_lstats_procfs); +device_initcall(init_lstats_procfs); Index: linux-2.6-tip/kernel/lockdep.c =================================================================== --- linux-2.6-tip.orig/kernel/lockdep.c +++ linux-2.6-tip/kernel/lockdep.c @@ -41,6 +41,7 @@ #include #include #include +#include #include @@ -68,7 +69,7 @@ module_param(lock_stat, int, 0644); * to use a raw spinlock - we really dont want the spinlock * code to recurse back into the lockdep code... */ -static raw_spinlock_t lockdep_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; +static __raw_spinlock_t lockdep_lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; static int graph_lock(void) { @@ -310,12 +311,14 @@ EXPORT_SYMBOL(lockdep_on); #if VERBOSE # define HARDIRQ_VERBOSE 1 # define SOFTIRQ_VERBOSE 1 +# define RECLAIM_VERBOSE 1 #else # define HARDIRQ_VERBOSE 0 # define SOFTIRQ_VERBOSE 0 +# define RECLAIM_VERBOSE 0 #endif -#if VERBOSE || HARDIRQ_VERBOSE || SOFTIRQ_VERBOSE +#if VERBOSE || HARDIRQ_VERBOSE || SOFTIRQ_VERBOSE || RECLAIM_VERBOSE /* * Quick filtering for interesting events: */ @@ -443,17 +446,18 @@ atomic_t nr_find_usage_backwards_recursi * Locking printouts: */ +#define __USAGE(__STATE) \ + [LOCK_USED_IN_##__STATE] = "IN-"__stringify(__STATE)"-W", \ + [LOCK_ENABLED_##__STATE] = __stringify(__STATE)"-ON-W", \ + [LOCK_USED_IN_##__STATE##_READ] = "IN-"__stringify(__STATE)"-R",\ + [LOCK_ENABLED_##__STATE##_READ] = __stringify(__STATE)"-ON-R", + static const char *usage_str[] = { - [LOCK_USED] = "initial-use ", - [LOCK_USED_IN_HARDIRQ] = "in-hardirq-W", - [LOCK_USED_IN_SOFTIRQ] = "in-softirq-W", - [LOCK_ENABLED_SOFTIRQS] = "softirq-on-W", - [LOCK_ENABLED_HARDIRQS] = "hardirq-on-W", - [LOCK_USED_IN_HARDIRQ_READ] = "in-hardirq-R", - [LOCK_USED_IN_SOFTIRQ_READ] = "in-softirq-R", - [LOCK_ENABLED_SOFTIRQS_READ] = "softirq-on-R", - [LOCK_ENABLED_HARDIRQS_READ] = "hardirq-on-R", +#define LOCKDEP_STATE(__STATE) __USAGE(__STATE) +#include "lockdep_states.h" +#undef LOCKDEP_STATE + [LOCK_USED] = "INITIAL USE", }; const char * __get_key_name(struct lockdep_subclass_key *key, char *str) @@ -461,46 +465,45 @@ const char * __get_key_name(struct lockd return kallsyms_lookup((unsigned long)key, NULL, NULL, NULL, str); } -void -get_usage_chars(struct lock_class *class, char *c1, char *c2, char *c3, char *c4) +static inline unsigned long lock_flag(enum lock_usage_bit bit) { - *c1 = '.', *c2 = '.', *c3 = '.', *c4 = '.'; - - if (class->usage_mask & LOCKF_USED_IN_HARDIRQ) - *c1 = '+'; - else - if (class->usage_mask & LOCKF_ENABLED_HARDIRQS) - *c1 = '-'; + return 1UL << bit; +} - if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ) - *c2 = '+'; - else - if (class->usage_mask & LOCKF_ENABLED_SOFTIRQS) - *c2 = '-'; +static char get_usage_char(struct lock_class *class, enum lock_usage_bit bit) +{ + char c = '.'; - if (class->usage_mask & LOCKF_ENABLED_HARDIRQS_READ) - *c3 = '-'; - if (class->usage_mask & LOCKF_USED_IN_HARDIRQ_READ) { - *c3 = '+'; - if (class->usage_mask & LOCKF_ENABLED_HARDIRQS_READ) - *c3 = '?'; - } - - if (class->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ) - *c4 = '-'; - if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ) { - *c4 = '+'; - if (class->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ) - *c4 = '?'; + if (class->usage_mask & lock_flag(bit + 2)) + c = '+'; + if (class->usage_mask & lock_flag(bit)) { + c = '-'; + if (class->usage_mask & lock_flag(bit + 2)) + c = '?'; } + + return c; +} + +void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS]) +{ + int i = 0; + +#define LOCKDEP_STATE(__STATE) \ + usage[i++] = get_usage_char(class, LOCK_USED_IN_##__STATE); \ + usage[i++] = get_usage_char(class, LOCK_USED_IN_##__STATE##_READ); +#include "lockdep_states.h" +#undef LOCKDEP_STATE + + usage[i] = '\0'; } static void print_lock_name(struct lock_class *class) { - char str[KSYM_NAME_LEN], c1, c2, c3, c4; + char str[KSYM_NAME_LEN], usage[LOCK_USAGE_CHARS]; const char *name; - get_usage_chars(class, &c1, &c2, &c3, &c4); + get_usage_chars(class, usage); name = class->name; if (!name) { @@ -513,7 +516,7 @@ static void print_lock_name(struct lock_ if (class->subclass) printk("/%d", class->subclass); } - printk("){%c%c%c%c}", c1, c2, c3, c4); + printk("){%s}", usage); } static void print_lockdep_cache(struct lockdep_map *lock) @@ -846,6 +849,21 @@ out_unlock_set: return class; } +#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_TRACE_IRQFLAGS) + +#define RECURSION_LIMIT 40 + +static int noinline print_infinite_recursion_bug(void) +{ + if (!debug_locks_off_graph_unlock()) + return 0; + + WARN_ON(1); + + return 0; +} +#endif /* CONFIG_PROVE_LOCKING || CONFIG_TRACE_IRQFLAGS */ + #ifdef CONFIG_PROVE_LOCKING /* * Allocate a lockdep entry. (assumes the graph_lock held, returns @@ -976,18 +994,6 @@ static noinline int print_circular_bug_t return 0; } -#define RECURSION_LIMIT 40 - -static int noinline print_infinite_recursion_bug(void) -{ - if (!debug_locks_off_graph_unlock()) - return 0; - - WARN_ON(1); - - return 0; -} - unsigned long __lockdep_count_forward_deps(struct lock_class *class, unsigned int depth) { @@ -1180,6 +1186,7 @@ find_usage_backwards(struct lock_class * return 1; } +#ifdef CONFIG_PROVE_LOCKING static int print_bad_irq_dependency(struct task_struct *curr, struct held_lock *prev, @@ -1240,6 +1247,7 @@ print_bad_irq_dependency(struct task_str return 0; } +#endif /* CONFIG_PROVE_LOCKING */ static int check_usage(struct task_struct *curr, struct held_lock *prev, @@ -1263,9 +1271,49 @@ check_usage(struct task_struct *curr, st bit_backwards, bit_forwards, irqclass); } -static int -check_prev_add_irq(struct task_struct *curr, struct held_lock *prev, - struct held_lock *next) +static const char *state_names[] = { +#define LOCKDEP_STATE(__STATE) \ + __stringify(__STATE), +#include "lockdep_states.h" +#undef LOCKDEP_STATE +}; + +static const char *state_rnames[] = { +#define LOCKDEP_STATE(__STATE) \ + __stringify(__STATE)"-READ", +#include "lockdep_states.h" +#undef LOCKDEP_STATE +}; + +static inline const char *state_name(enum lock_usage_bit bit) +{ + return (bit & 1) ? state_rnames[bit >> 2] : state_names[bit >> 2]; +} + +static int exclusive_bit(int new_bit) +{ + /* + * USED_IN + * USED_IN_READ + * ENABLED + * ENABLED_READ + * + * bit 0 - write/read + * bit 1 - used_in/enabled + * bit 2+ state + */ + + int state = new_bit & ~3; + int dir = new_bit & 2; + + /* + * keep state, bit flip the direction and strip read. + */ + return state | (dir ^ 2); +} + +static int check_irq_usage(struct task_struct *curr, struct held_lock *prev, + struct held_lock *next, enum lock_usage_bit bit) { /* * Prove that the new dependency does not connect a hardirq-safe @@ -1273,38 +1321,34 @@ check_prev_add_irq(struct task_struct *c * the backwards-subgraph starting at , and the * forwards-subgraph starting at : */ - if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ, - LOCK_ENABLED_HARDIRQS, "hard")) + if (!check_usage(curr, prev, next, bit, + exclusive_bit(bit), state_name(bit))) return 0; + bit++; /* _READ */ + /* * Prove that the new dependency does not connect a hardirq-safe-read * lock with a hardirq-unsafe lock - to achieve this we search * the backwards-subgraph starting at , and the * forwards-subgraph starting at : */ - if (!check_usage(curr, prev, next, LOCK_USED_IN_HARDIRQ_READ, - LOCK_ENABLED_HARDIRQS, "hard-read")) + if (!check_usage(curr, prev, next, bit, + exclusive_bit(bit), state_name(bit))) return 0; - /* - * Prove that the new dependency does not connect a softirq-safe - * lock with a softirq-unsafe lock - to achieve this we search - * the backwards-subgraph starting at , and the - * forwards-subgraph starting at : - */ - if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ, - LOCK_ENABLED_SOFTIRQS, "soft")) - return 0; - /* - * Prove that the new dependency does not connect a softirq-safe-read - * lock with a softirq-unsafe lock - to achieve this we search - * the backwards-subgraph starting at , and the - * forwards-subgraph starting at : - */ - if (!check_usage(curr, prev, next, LOCK_USED_IN_SOFTIRQ_READ, - LOCK_ENABLED_SOFTIRQS, "soft")) + return 1; +} + +static int +check_prev_add_irq(struct task_struct *curr, struct held_lock *prev, + struct held_lock *next) +{ +#define LOCKDEP_STATE(__STATE) \ + if (!check_irq_usage(curr, prev, next, LOCK_USED_IN_##__STATE)) \ return 0; +#include "lockdep_states.h" +#undef LOCKDEP_STATE return 1; } @@ -1933,7 +1977,7 @@ void print_irqtrace_events(struct task_s print_ip_sym(curr->softirq_disable_ip); } -static int hardirq_verbose(struct lock_class *class) +static int HARDIRQ_verbose(struct lock_class *class) { #if HARDIRQ_VERBOSE return class_filter(class); @@ -1941,7 +1985,7 @@ static int hardirq_verbose(struct lock_c return 0; } -static int softirq_verbose(struct lock_class *class) +static int SOFTIRQ_verbose(struct lock_class *class) { #if SOFTIRQ_VERBOSE return class_filter(class); @@ -1949,185 +1993,94 @@ static int softirq_verbose(struct lock_c return 0; } +static int RECLAIM_FS_verbose(struct lock_class *class) +{ +#if RECLAIM_VERBOSE + return class_filter(class); +#endif + return 0; +} + #define STRICT_READ_CHECKS 1 -static int mark_lock_irq(struct task_struct *curr, struct held_lock *this, - enum lock_usage_bit new_bit) +static int (*state_verbose_f[])(struct lock_class *class) = { +#define LOCKDEP_STATE(__STATE) \ + __STATE##_verbose, +#include "lockdep_states.h" +#undef LOCKDEP_STATE +}; + +static inline int state_verbose(enum lock_usage_bit bit, + struct lock_class *class) { - int ret = 1; + return state_verbose_f[bit >> 2](class); +} - switch(new_bit) { - case LOCK_USED_IN_HARDIRQ: - if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS)) - return 0; - if (!valid_state(curr, this, new_bit, - LOCK_ENABLED_HARDIRQS_READ)) - return 0; - /* - * just marked it hardirq-safe, check that this lock - * took no hardirq-unsafe lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_HARDIRQS, "hard")) - return 0; -#if STRICT_READ_CHECKS - /* - * just marked it hardirq-safe, check that this lock - * took no hardirq-unsafe-read lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_HARDIRQS_READ, "hard-read")) - return 0; -#endif - if (hardirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_USED_IN_SOFTIRQ: - if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS)) - return 0; - if (!valid_state(curr, this, new_bit, - LOCK_ENABLED_SOFTIRQS_READ)) - return 0; - /* - * just marked it softirq-safe, check that this lock - * took no softirq-unsafe lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_SOFTIRQS, "soft")) - return 0; -#if STRICT_READ_CHECKS - /* - * just marked it softirq-safe, check that this lock - * took no softirq-unsafe-read lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_SOFTIRQS_READ, "soft-read")) - return 0; -#endif - if (softirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_USED_IN_HARDIRQ_READ: - if (!valid_state(curr, this, new_bit, LOCK_ENABLED_HARDIRQS)) - return 0; - /* - * just marked it hardirq-read-safe, check that this lock - * took no hardirq-unsafe lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_HARDIRQS, "hard")) - return 0; - if (hardirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_USED_IN_SOFTIRQ_READ: - if (!valid_state(curr, this, new_bit, LOCK_ENABLED_SOFTIRQS)) - return 0; - /* - * just marked it softirq-read-safe, check that this lock - * took no softirq-unsafe lock in the past: - */ - if (!check_usage_forwards(curr, this, - LOCK_ENABLED_SOFTIRQS, "soft")) - return 0; - if (softirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_ENABLED_HARDIRQS: - if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ)) - return 0; - if (!valid_state(curr, this, new_bit, - LOCK_USED_IN_HARDIRQ_READ)) - return 0; - /* - * just marked it hardirq-unsafe, check that no hardirq-safe - * lock in the system ever took it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_HARDIRQ, "hard")) - return 0; -#if STRICT_READ_CHECKS - /* - * just marked it hardirq-unsafe, check that no - * hardirq-safe-read lock in the system ever took - * it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_HARDIRQ_READ, "hard-read")) - return 0; -#endif - if (hardirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_ENABLED_SOFTIRQS: - if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ)) - return 0; - if (!valid_state(curr, this, new_bit, - LOCK_USED_IN_SOFTIRQ_READ)) - return 0; - /* - * just marked it softirq-unsafe, check that no softirq-safe - * lock in the system ever took it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_SOFTIRQ, "soft")) - return 0; -#if STRICT_READ_CHECKS - /* - * just marked it softirq-unsafe, check that no - * softirq-safe-read lock in the system ever took - * it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_SOFTIRQ_READ, "soft-read")) - return 0; -#endif - if (softirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_ENABLED_HARDIRQS_READ: - if (!valid_state(curr, this, new_bit, LOCK_USED_IN_HARDIRQ)) - return 0; -#if STRICT_READ_CHECKS - /* - * just marked it hardirq-read-unsafe, check that no - * hardirq-safe lock in the system ever took it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_HARDIRQ, "hard")) - return 0; -#endif - if (hardirq_verbose(hlock_class(this))) - ret = 2; - break; - case LOCK_ENABLED_SOFTIRQS_READ: - if (!valid_state(curr, this, new_bit, LOCK_USED_IN_SOFTIRQ)) +typedef int (*check_usage_f)(struct task_struct *, struct held_lock *, + enum lock_usage_bit bit, const char *name); + +static int +mark_lock_irq(struct task_struct *curr, struct held_lock *this, int new_bit) +{ + int excl_bit = exclusive_bit(new_bit); + int read = new_bit & 1; + int dir = new_bit & 2; + + /* + * mark USED_IN has to look forwards -- to ensure no dependency + * has ENABLED state, which would allow recursion deadlocks. + * + * mark ENABLED has to look backwards -- to ensure no dependee + * has USED_IN state, which, again, would allow recursion deadlocks. + */ + check_usage_f usage = dir ? + check_usage_backwards : check_usage_forwards; + + /* + * Validate that this particular lock does not have conflicting + * usage states. + */ + if (!valid_state(curr, this, new_bit, excl_bit)) + return 0; + + /* + * Validate that the lock dependencies don't have conflicting usage + * states. + */ + if ((!read || !dir || STRICT_READ_CHECKS) && + !usage(curr, this, excl_bit, state_name(new_bit))) + return 0; + + /* + * Check for read in write conflicts + */ + if (!read) { + if (!valid_state(curr, this, new_bit, excl_bit + 1)) return 0; -#if STRICT_READ_CHECKS - /* - * just marked it softirq-read-unsafe, check that no - * softirq-safe lock in the system ever took it in the past: - */ - if (!check_usage_backwards(curr, this, - LOCK_USED_IN_SOFTIRQ, "soft")) + + if (STRICT_READ_CHECKS && + !usage(curr, this, excl_bit + 1, + state_name(new_bit + 1))) return 0; -#endif - if (softirq_verbose(hlock_class(this))) - ret = 2; - break; - default: - WARN_ON(1); - break; } - return ret; + if (state_verbose(new_bit, hlock_class(this))) + return 2; + + return 1; } +enum mark_type { +#define LOCKDEP_STATE(__STATE) __STATE, +#include "lockdep_states.h" +#undef LOCKDEP_STATE +}; + /* * Mark all held locks with a usage bit: */ static int -mark_held_locks(struct task_struct *curr, int hardirq) +mark_held_locks(struct task_struct *curr, enum mark_type mark) { enum lock_usage_bit usage_bit; struct held_lock *hlock; @@ -2136,17 +2089,12 @@ mark_held_locks(struct task_struct *curr for (i = 0; i < curr->lockdep_depth; i++) { hlock = curr->held_locks + i; - if (hardirq) { - if (hlock->read) - usage_bit = LOCK_ENABLED_HARDIRQS_READ; - else - usage_bit = LOCK_ENABLED_HARDIRQS; - } else { - if (hlock->read) - usage_bit = LOCK_ENABLED_SOFTIRQS_READ; - else - usage_bit = LOCK_ENABLED_SOFTIRQS; - } + usage_bit = 2 + (mark << 2); /* ENABLED */ + if (hlock->read) + usage_bit += 1; /* READ */ + + BUG_ON(usage_bit >= LOCK_USAGE_STATES); + if (!mark_lock(curr, hlock, usage_bit)) return 0; } @@ -2200,7 +2148,7 @@ void trace_hardirqs_on_caller(unsigned l * We are going to turn hardirqs on, so set the * usage bit for all held locks: */ - if (!mark_held_locks(curr, 1)) + if (!mark_held_locks(curr, HARDIRQ)) return; /* * If we have softirqs enabled, then set the usage @@ -2208,7 +2156,7 @@ void trace_hardirqs_on_caller(unsigned l * this bit from being set before) */ if (curr->softirqs_enabled) - if (!mark_held_locks(curr, 0)) + if (!mark_held_locks(curr, SOFTIRQ)) return; curr->hardirq_enable_ip = ip; @@ -2288,7 +2236,7 @@ void trace_softirqs_on(unsigned long ip) * enabled too: */ if (curr->hardirqs_enabled) - mark_held_locks(curr, 0); + mark_held_locks(curr, SOFTIRQ); } /* @@ -2317,6 +2265,31 @@ void trace_softirqs_off(unsigned long ip debug_atomic_inc(&redundant_softirqs_off); } +void lockdep_trace_alloc(gfp_t gfp_mask) +{ + struct task_struct *curr = current; + + if (unlikely(!debug_locks)) + return; + + /* no reclaim without waiting on it */ + if (!(gfp_mask & __GFP_WAIT)) + return; + + /* this guy won't enter reclaim */ + if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) + return; + + /* We're only interested __GFP_FS allocations for now */ + if (!(gfp_mask & __GFP_FS)) + return; + + if (DEBUG_LOCKS_WARN_ON(irqs_disabled())) + return; + + mark_held_locks(curr, RECLAIM_FS); +} + static int mark_irqflags(struct task_struct *curr, struct held_lock *hlock) { /* @@ -2345,19 +2318,35 @@ static int mark_irqflags(struct task_str if (!hlock->hardirqs_off) { if (hlock->read) { if (!mark_lock(curr, hlock, - LOCK_ENABLED_HARDIRQS_READ)) + LOCK_ENABLED_HARDIRQ_READ)) return 0; if (curr->softirqs_enabled) if (!mark_lock(curr, hlock, - LOCK_ENABLED_SOFTIRQS_READ)) + LOCK_ENABLED_SOFTIRQ_READ)) return 0; } else { if (!mark_lock(curr, hlock, - LOCK_ENABLED_HARDIRQS)) + LOCK_ENABLED_HARDIRQ)) return 0; if (curr->softirqs_enabled) if (!mark_lock(curr, hlock, - LOCK_ENABLED_SOFTIRQS)) + LOCK_ENABLED_SOFTIRQ)) + return 0; + } + } + + /* + * We reuse the irq context infrastructure more broadly as a general + * context checking code. This tests GFP_FS recursion (a lock taken + * during reclaim for a GFP_FS allocation is held over a GFP_FS + * allocation). + */ + if (!hlock->trylock && (curr->lockdep_reclaim_gfp & __GFP_FS)) { + if (hlock->read) { + if (!mark_lock(curr, hlock, LOCK_USED_IN_RECLAIM_FS_READ)) + return 0; + } else { + if (!mark_lock(curr, hlock, LOCK_USED_IN_RECLAIM_FS)) return 0; } } @@ -2412,6 +2401,10 @@ static inline int separate_irq_context(s return 0; } +void lockdep_trace_alloc(gfp_t gfp_mask) +{ +} + #endif /* @@ -2445,14 +2438,13 @@ static int mark_lock(struct task_struct return 0; switch (new_bit) { - case LOCK_USED_IN_HARDIRQ: - case LOCK_USED_IN_SOFTIRQ: - case LOCK_USED_IN_HARDIRQ_READ: - case LOCK_USED_IN_SOFTIRQ_READ: - case LOCK_ENABLED_HARDIRQS: - case LOCK_ENABLED_SOFTIRQS: - case LOCK_ENABLED_HARDIRQS_READ: - case LOCK_ENABLED_SOFTIRQS_READ: +#define LOCKDEP_STATE(__STATE) \ + case LOCK_USED_IN_##__STATE: \ + case LOCK_USED_IN_##__STATE##_READ: \ + case LOCK_ENABLED_##__STATE: \ + case LOCK_ENABLED_##__STATE##_READ: +#include "lockdep_states.h" +#undef LOCKDEP_STATE ret = mark_lock_irq(curr, this, new_bit); if (!ret) return 0; @@ -2966,6 +2958,16 @@ void lock_release(struct lockdep_map *lo } EXPORT_SYMBOL_GPL(lock_release); +void lockdep_set_current_reclaim_state(gfp_t gfp_mask) +{ + current->lockdep_reclaim_gfp = gfp_mask; +} + +void lockdep_clear_current_reclaim_state(void) +{ + current->lockdep_reclaim_gfp = 0; +} + #ifdef CONFIG_LOCK_STAT static int print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock, Index: linux-2.6-tip/kernel/lockdep_internals.h =================================================================== --- linux-2.6-tip.orig/kernel/lockdep_internals.h +++ linux-2.6-tip/kernel/lockdep_internals.h @@ -7,6 +7,45 @@ */ /* + * Lock-class usage-state bits: + */ +enum lock_usage_bit { +#define LOCKDEP_STATE(__STATE) \ + LOCK_USED_IN_##__STATE, \ + LOCK_USED_IN_##__STATE##_READ, \ + LOCK_ENABLED_##__STATE, \ + LOCK_ENABLED_##__STATE##_READ, +#include "lockdep_states.h" +#undef LOCKDEP_STATE + LOCK_USED, + LOCK_USAGE_STATES +}; + +/* + * Usage-state bitmasks: + */ +#define __LOCKF(__STATE) LOCKF_##__STATE = (1 << LOCK_##__STATE), + +enum { +#define LOCKDEP_STATE(__STATE) \ + __LOCKF(USED_IN_##__STATE) \ + __LOCKF(USED_IN_##__STATE##_READ) \ + __LOCKF(ENABLED_##__STATE) \ + __LOCKF(ENABLED_##__STATE##_READ) +#include "lockdep_states.h" +#undef LOCKDEP_STATE + __LOCKF(USED) +}; + +#define LOCKF_ENABLED_IRQ (LOCKF_ENABLED_HARDIRQ | LOCKF_ENABLED_SOFTIRQ) +#define LOCKF_USED_IN_IRQ (LOCKF_USED_IN_HARDIRQ | LOCKF_USED_IN_SOFTIRQ) + +#define LOCKF_ENABLED_IRQ_READ \ + (LOCKF_ENABLED_HARDIRQ_READ | LOCKF_ENABLED_SOFTIRQ_READ) +#define LOCKF_USED_IN_IRQ_READ \ + (LOCKF_USED_IN_HARDIRQ_READ | LOCKF_USED_IN_SOFTIRQ_READ) + +/* * MAX_LOCKDEP_ENTRIES is the maximum number of lock dependencies * we track. * @@ -31,8 +70,10 @@ extern struct list_head all_lock_classes; extern struct lock_chain lock_chains[]; -extern void -get_usage_chars(struct lock_class *class, char *c1, char *c2, char *c3, char *c4); +#define LOCK_USAGE_CHARS (1+LOCK_USAGE_STATES/2) + +extern void get_usage_chars(struct lock_class *class, + char usage[LOCK_USAGE_CHARS]); extern const char * __get_key_name(struct lockdep_subclass_key *key, char *str); Index: linux-2.6-tip/kernel/lockdep_proc.c =================================================================== --- linux-2.6-tip.orig/kernel/lockdep_proc.c +++ linux-2.6-tip/kernel/lockdep_proc.c @@ -84,7 +84,7 @@ static int l_show(struct seq_file *m, vo { struct lock_class *class = v; struct lock_list *entry; - char c1, c2, c3, c4; + char usage[LOCK_USAGE_CHARS]; if (v == SEQ_START_TOKEN) { seq_printf(m, "all lock classes:\n"); @@ -100,8 +100,8 @@ static int l_show(struct seq_file *m, vo seq_printf(m, " BD:%5ld", lockdep_count_backward_deps(class)); #endif - get_usage_chars(class, &c1, &c2, &c3, &c4); - seq_printf(m, " %c%c%c%c", c1, c2, c3, c4); + get_usage_chars(class, usage); + seq_printf(m, " %s", usage); seq_printf(m, ": "); print_name(m, class); @@ -300,27 +300,27 @@ static int lockdep_stats_show(struct seq nr_uncategorized++; if (class->usage_mask & LOCKF_USED_IN_IRQ) nr_irq_safe++; - if (class->usage_mask & LOCKF_ENABLED_IRQS) + if (class->usage_mask & LOCKF_ENABLED_IRQ) nr_irq_unsafe++; if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ) nr_softirq_safe++; - if (class->usage_mask & LOCKF_ENABLED_SOFTIRQS) + if (class->usage_mask & LOCKF_ENABLED_SOFTIRQ) nr_softirq_unsafe++; if (class->usage_mask & LOCKF_USED_IN_HARDIRQ) nr_hardirq_safe++; - if (class->usage_mask & LOCKF_ENABLED_HARDIRQS) + if (class->usage_mask & LOCKF_ENABLED_HARDIRQ) nr_hardirq_unsafe++; if (class->usage_mask & LOCKF_USED_IN_IRQ_READ) nr_irq_read_safe++; - if (class->usage_mask & LOCKF_ENABLED_IRQS_READ) + if (class->usage_mask & LOCKF_ENABLED_IRQ_READ) nr_irq_read_unsafe++; if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ) nr_softirq_read_safe++; - if (class->usage_mask & LOCKF_ENABLED_SOFTIRQS_READ) + if (class->usage_mask & LOCKF_ENABLED_SOFTIRQ_READ) nr_softirq_read_unsafe++; if (class->usage_mask & LOCKF_USED_IN_HARDIRQ_READ) nr_hardirq_read_safe++; - if (class->usage_mask & LOCKF_ENABLED_HARDIRQS_READ) + if (class->usage_mask & LOCKF_ENABLED_HARDIRQ_READ) nr_hardirq_read_unsafe++; #ifdef CONFIG_PROVE_LOCKING @@ -601,6 +601,10 @@ static void seq_stats(struct seq_file *m static void seq_header(struct seq_file *m) { seq_printf(m, "lock_stat version 0.3\n"); + + if (unlikely(!debug_locks)) + seq_printf(m, "*WARNING* lock debugging disabled!! - possibly due to a lockdep warning\n"); + seq_line(m, '-', 0, 40 + 1 + 10 * (14 + 1)); seq_printf(m, "%40s %14s %14s %14s %14s %14s %14s %14s %14s " "%14s %14s\n", Index: linux-2.6-tip/kernel/lockdep_states.h =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/lockdep_states.h @@ -0,0 +1,9 @@ +/* + * Lockdep states, + * + * please update XXX_LOCK_USAGE_STATES in include/linux/lockdep.h whenever + * you add one, or come up with a nice dynamic solution. + */ +LOCKDEP_STATE(HARDIRQ) +LOCKDEP_STATE(SOFTIRQ) +LOCKDEP_STATE(RECLAIM_FS) Index: linux-2.6-tip/kernel/marker.c =================================================================== --- linux-2.6-tip.orig/kernel/marker.c +++ linux-2.6-tip/kernel/marker.c @@ -432,7 +432,7 @@ static int remove_marker(const char *nam { struct hlist_head *head; struct hlist_node *node; - struct marker_entry *e; + struct marker_entry *uninitialized_var(e); int found = 0; size_t len = strlen(name) + 1; u32 hash = jhash(name, len-1, 0); Index: linux-2.6-tip/kernel/module.c =================================================================== --- linux-2.6-tip.orig/kernel/module.c +++ linux-2.6-tip/kernel/module.c @@ -2735,7 +2735,7 @@ int is_module_address(unsigned long addr /* Is this a valid kernel address? */ -__notrace_funcgraph struct module *__module_text_address(unsigned long addr) +struct module *__module_text_address(unsigned long addr) { struct module *mod; Index: linux-2.6-tip/kernel/mutex-debug.c =================================================================== --- linux-2.6-tip.orig/kernel/mutex-debug.c +++ linux-2.6-tip/kernel/mutex-debug.c @@ -26,11 +26,6 @@ /* * Must be called with lock->wait_lock held. */ -void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner) -{ - lock->owner = new_owner; -} - void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) { memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); @@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex /* Mark the current thread as blocked on the lock: */ ti->task->blocked_on = waiter; - waiter->lock = lock; } void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter, @@ -82,7 +76,7 @@ void debug_mutex_unlock(struct mutex *lo DEBUG_LOCKS_WARN_ON(lock->magic != lock); DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info()); DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next); - DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info()); + mutex_clear_owner(lock); } void debug_mutex_init(struct mutex *lock, const char *name, @@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - lock->owner = NULL; lock->magic = lock; } Index: linux-2.6-tip/kernel/mutex-debug.h =================================================================== --- linux-2.6-tip.orig/kernel/mutex-debug.h +++ linux-2.6-tip/kernel/mutex-debug.h @@ -13,14 +13,6 @@ /* * This must be called with lock->wait_lock held. */ -extern void -debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner); - -static inline void debug_mutex_clear_owner(struct mutex *lock) -{ - lock->owner = NULL; -} - extern void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter); extern void debug_mutex_wake_waiter(struct mutex *lock, @@ -35,6 +27,16 @@ extern void debug_mutex_unlock(struct mu extern void debug_mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key); +static inline void mutex_set_owner(struct mutex *lock) +{ + lock->owner = current_thread_info(); +} + +static inline void mutex_clear_owner(struct mutex *lock) +{ + lock->owner = NULL; +} + #define spin_lock_mutex(lock, flags) \ do { \ struct mutex *l = container_of(lock, struct mutex, wait_lock); \ Index: linux-2.6-tip/kernel/mutex.c =================================================================== --- linux-2.6-tip.orig/kernel/mutex.c +++ linux-2.6-tip/kernel/mutex.c @@ -10,6 +10,11 @@ * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and * David Howells for suggestions and improvements. * + * - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline + * from the -rt tree, where it was originally implemented for rtmutexes + * by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale + * and Sven Dietrich. + * * Also see Documentation/mutex-design.txt. */ #include @@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c atomic_set(&lock->count, 1); spin_lock_init(&lock->wait_lock); INIT_LIST_HEAD(&lock->wait_list); + mutex_clear_owner(lock); debug_mutex_init(lock, name, key); } @@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu * 'unlocked' into 'locked' state. */ __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath); + mutex_set_owner(lock); } EXPORT_SYMBOL(mutex_lock); @@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex * * The unlocking fastpath is the 0->1 transition from 'locked' * into 'unlocked' state: */ +#ifndef CONFIG_DEBUG_MUTEXES + /* + * When debugging is enabled we must not clear the owner before time, + * the slow path will always be taken, and that clears the owner field + * after verifying that it was indeed current. + */ + mutex_clear_owner(lock); +#endif __mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath); } @@ -129,21 +144,75 @@ __mutex_lock_common(struct mutex *lock, { struct task_struct *task = current; struct mutex_waiter waiter; - unsigned int old_val; unsigned long flags; + preempt_disable(); + mutex_acquire(&lock->dep_map, subclass, 0, ip); +#if defined(CONFIG_SMP) && !defined(CONFIG_DEBUG_MUTEXES) + /* + * Optimistic spinning. + * + * We try to spin for acquisition when we find that there are no + * pending waiters and the lock owner is currently running on a + * (different) CPU. + * + * The rationale is that if the lock owner is running, it is likely to + * release the lock soon. + * + * Since this needs the lock owner, and this mutex implementation + * doesn't track the owner atomically in the lock field, we need to + * track it non-atomically. + * + * We can't do this for DEBUG_MUTEXES because that relies on wait_lock + * to serialize everything. + */ + + for (;;) { + struct thread_info *owner; + + /* + * If there's an owner, wait for it to either + * release the lock or go to sleep. + */ + owner = ACCESS_ONCE(lock->owner); + if (owner && !mutex_spin_on_owner(lock, owner)) + break; + + if (atomic_cmpxchg(&lock->count, 1, 0) == 1) { + lock_acquired(&lock->dep_map, ip); + mutex_set_owner(lock); + preempt_enable(); + return 0; + } + + /* + * When there's no owner, we might have preempted between the + * owner acquiring the lock and setting the owner field. If + * we're an RT task that will live-lock because we won't let + * the owner complete. + */ + if (!owner && (need_resched() || rt_task(task))) + break; + + /* + * The cpu_relax() call is a compiler barrier which forces + * everything in this loop to be re-loaded. We don't need + * memory barriers as we'll eventually observe the right + * values at the cost of a few extra spins. + */ + cpu_relax(); + } +#endif spin_lock_mutex(&lock->wait_lock, flags); debug_mutex_lock_common(lock, &waiter); - mutex_acquire(&lock->dep_map, subclass, 0, ip); debug_mutex_add_waiter(lock, &waiter, task_thread_info(task)); /* add waiting tasks to the end of the waitqueue (FIFO): */ list_add_tail(&waiter.list, &lock->wait_list); waiter.task = task; - old_val = atomic_xchg(&lock->count, -1); - if (old_val == 1) + if (atomic_xchg(&lock->count, -1) == 1) goto done; lock_contended(&lock->dep_map, ip); @@ -158,8 +227,7 @@ __mutex_lock_common(struct mutex *lock, * that when we release the lock, we properly wake up the * other waiters: */ - old_val = atomic_xchg(&lock->count, -1); - if (old_val == 1) + if (atomic_xchg(&lock->count, -1) == 1) break; /* @@ -173,21 +241,28 @@ __mutex_lock_common(struct mutex *lock, spin_unlock_mutex(&lock->wait_lock, flags); debug_mutex_free_waiter(&waiter); + preempt_enable(); return -EINTR; } __set_task_state(task, state); /* didnt get the lock, go to sleep: */ spin_unlock_mutex(&lock->wait_lock, flags); - schedule(); + + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); + preempt_disable(); + local_irq_enable(); + spin_lock_mutex(&lock->wait_lock, flags); } done: lock_acquired(&lock->dep_map, ip); /* got the lock - rejoice! */ - mutex_remove_waiter(lock, &waiter, task_thread_info(task)); - debug_mutex_set_owner(lock, task_thread_info(task)); + mutex_remove_waiter(lock, &waiter, current_thread_info()); + mutex_set_owner(lock); /* set it to 0 if there are no waiters left: */ if (likely(list_empty(&lock->wait_list))) @@ -196,6 +271,7 @@ done: spin_unlock_mutex(&lock->wait_lock, flags); debug_mutex_free_waiter(&waiter); + preempt_enable(); return 0; } @@ -222,7 +298,8 @@ int __sched mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass) { might_sleep(); - return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, _RET_IP_); + return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, + subclass, _RET_IP_); } EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested); @@ -260,8 +337,6 @@ __mutex_unlock_common_slowpath(atomic_t wake_up_process(waiter->task); } - debug_mutex_clear_owner(lock); - spin_unlock_mutex(&lock->wait_lock, flags); } @@ -298,18 +373,30 @@ __mutex_lock_interruptible_slowpath(atom */ int __sched mutex_lock_interruptible(struct mutex *lock) { + int ret; + might_sleep(); - return __mutex_fastpath_lock_retval + ret = __mutex_fastpath_lock_retval (&lock->count, __mutex_lock_interruptible_slowpath); + if (!ret) + mutex_set_owner(lock); + + return ret; } EXPORT_SYMBOL(mutex_lock_interruptible); int __sched mutex_lock_killable(struct mutex *lock) { + int ret; + might_sleep(); - return __mutex_fastpath_lock_retval + ret = __mutex_fastpath_lock_retval (&lock->count, __mutex_lock_killable_slowpath); + if (!ret) + mutex_set_owner(lock); + + return ret; } EXPORT_SYMBOL(mutex_lock_killable); @@ -352,9 +439,10 @@ static inline int __mutex_trylock_slowpa prev = atomic_xchg(&lock->count, -1); if (likely(prev == 1)) { - debug_mutex_set_owner(lock, current_thread_info()); + mutex_set_owner(lock); mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); } + /* Set it back to 0 if there are no waiters: */ if (likely(list_empty(&lock->wait_list))) atomic_set(&lock->count, 0); @@ -380,8 +468,13 @@ static inline int __mutex_trylock_slowpa */ int __sched mutex_trylock(struct mutex *lock) { - return __mutex_fastpath_trylock(&lock->count, - __mutex_trylock_slowpath); + int ret; + + ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath); + if (ret) + mutex_set_owner(lock); + + return ret; } EXPORT_SYMBOL(mutex_trylock); Index: linux-2.6-tip/kernel/mutex.h =================================================================== --- linux-2.6-tip.orig/kernel/mutex.h +++ linux-2.6-tip/kernel/mutex.h @@ -16,8 +16,26 @@ #define mutex_remove_waiter(lock, waiter, ti) \ __list_del((waiter)->list.prev, (waiter)->list.next) -#define debug_mutex_set_owner(lock, new_owner) do { } while (0) -#define debug_mutex_clear_owner(lock) do { } while (0) +#ifdef CONFIG_SMP +static inline void mutex_set_owner(struct mutex *lock) +{ + lock->owner = current_thread_info(); +} + +static inline void mutex_clear_owner(struct mutex *lock) +{ + lock->owner = NULL; +} +#else +static inline void mutex_set_owner(struct mutex *lock) +{ +} + +static inline void mutex_clear_owner(struct mutex *lock) +{ +} +#endif + #define debug_mutex_wake_waiter(lock, waiter) do { } while (0) #define debug_mutex_free_waiter(waiter) do { } while (0) #define debug_mutex_add_waiter(lock, waiter, ti) do { } while (0) Index: linux-2.6-tip/kernel/panic.c =================================================================== --- linux-2.6-tip.orig/kernel/panic.c +++ linux-2.6-tip/kernel/panic.c @@ -74,6 +74,9 @@ NORET_TYPE void panic(const char * fmt, vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf); +#ifdef CONFIG_DEBUG_BUGVERBOSE + dump_stack(); +#endif bust_spinlocks(0); /* @@ -355,15 +358,18 @@ EXPORT_SYMBOL(warn_slowpath); #endif #ifdef CONFIG_CC_STACKPROTECTOR + /* * Called when gcc's -fstack-protector feature is used, and * gcc detects corruption of the on-stack canary value */ void __stack_chk_fail(void) { - panic("stack-protector: Kernel stack is corrupted"); + panic("stack-protector: Kernel stack is corrupted in: %p\n", + __builtin_return_address(0)); } EXPORT_SYMBOL(__stack_chk_fail); + #endif core_param(panic, panic_timeout, int, 0644); Index: linux-2.6-tip/kernel/perf_counter.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/perf_counter.c @@ -0,0 +1,2208 @@ +/* + * Performance counter core code + * + * Copyright(C) 2008 Thomas Gleixner + * Copyright(C) 2008 Red Hat, Inc., Ingo Molnar + * + * For licencing details see kernel-base/COPYING + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Each CPU has a list of per CPU counters: + */ +DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context); + +int perf_max_counters __read_mostly = 1; +static int perf_reserved_percpu __read_mostly; +static int perf_overcommit __read_mostly = 1; + +/* + * Mutex for (sysadmin-configurable) counter reservations: + */ +static DEFINE_MUTEX(perf_resource_mutex); + +/* + * Architecture provided APIs - weak aliases: + */ +extern __weak const struct hw_perf_counter_ops * +hw_perf_counter_init(struct perf_counter *counter) +{ + return NULL; +} + +u64 __weak hw_perf_save_disable(void) { return 0; } +void __weak hw_perf_restore(u64 ctrl) { barrier(); } +void __weak hw_perf_counter_setup(int cpu) { barrier(); } +int __weak hw_perf_group_sched_in(struct perf_counter *group_leader, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx, int cpu) +{ + return 0; +} + +void __weak perf_counter_print_debug(void) { } + +static void +list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx) +{ + struct perf_counter *group_leader = counter->group_leader; + + /* + * Depending on whether it is a standalone or sibling counter, + * add it straight to the context's counter list, or to the group + * leader's sibling list: + */ + if (counter->group_leader == counter) + list_add_tail(&counter->list_entry, &ctx->counter_list); + else + list_add_tail(&counter->list_entry, &group_leader->sibling_list); +} + +static void +list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx) +{ + struct perf_counter *sibling, *tmp; + + list_del_init(&counter->list_entry); + + /* + * If this was a group counter with sibling counters then + * upgrade the siblings to singleton counters by adding them + * to the context list directly: + */ + list_for_each_entry_safe(sibling, tmp, + &counter->sibling_list, list_entry) { + + list_del_init(&sibling->list_entry); + list_add_tail(&sibling->list_entry, &ctx->counter_list); + sibling->group_leader = sibling; + } +} + +static void +counter_sched_out(struct perf_counter *counter, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx) +{ + if (counter->state != PERF_COUNTER_STATE_ACTIVE) + return; + + counter->state = PERF_COUNTER_STATE_INACTIVE; + counter->hw_ops->disable(counter); + counter->oncpu = -1; + + if (!is_software_counter(counter)) + cpuctx->active_oncpu--; + ctx->nr_active--; + if (counter->hw_event.exclusive || !cpuctx->active_oncpu) + cpuctx->exclusive = 0; +} + +static void +group_sched_out(struct perf_counter *group_counter, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx) +{ + struct perf_counter *counter; + + if (group_counter->state != PERF_COUNTER_STATE_ACTIVE) + return; + + counter_sched_out(group_counter, cpuctx, ctx); + + /* + * Schedule out siblings (if any): + */ + list_for_each_entry(counter, &group_counter->sibling_list, list_entry) + counter_sched_out(counter, cpuctx, ctx); + + if (group_counter->hw_event.exclusive) + cpuctx->exclusive = 0; +} + +/* + * Cross CPU call to remove a performance counter + * + * We disable the counter on the hardware level first. After that we + * remove it from the context list. + */ +static void __perf_counter_remove_from_context(void *info) +{ + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter *counter = info; + struct perf_counter_context *ctx = counter->ctx; + unsigned long flags; + u64 perf_flags; + + /* + * If this is a task context, we need to check whether it is + * the current task context of this cpu. If not it has been + * scheduled out before the smp call arrived. + */ + if (ctx->task && cpuctx->task_ctx != ctx) + return; + + curr_rq_lock_irq_save(&flags); + spin_lock(&ctx->lock); + + counter_sched_out(counter, cpuctx, ctx); + + counter->task = NULL; + ctx->nr_counters--; + + /* + * Protect the list operation against NMI by disabling the + * counters on a global level. NOP for non NMI based counters. + */ + perf_flags = hw_perf_save_disable(); + list_del_counter(counter, ctx); + hw_perf_restore(perf_flags); + + if (!ctx->task) { + /* + * Allow more per task counters with respect to the + * reservation: + */ + cpuctx->max_pertask = + min(perf_max_counters - ctx->nr_counters, + perf_max_counters - perf_reserved_percpu); + } + + spin_unlock(&ctx->lock); + curr_rq_unlock_irq_restore(&flags); +} + + +/* + * Remove the counter from a task's (or a CPU's) list of counters. + * + * Must be called with counter->mutex and ctx->mutex held. + * + * CPU counters are removed with a smp call. For task counters we only + * call when the task is on a CPU. + */ +static void perf_counter_remove_from_context(struct perf_counter *counter) +{ + struct perf_counter_context *ctx = counter->ctx; + struct task_struct *task = ctx->task; + + if (!task) { + /* + * Per cpu counters are removed via an smp call and + * the removal is always sucessful. + */ + smp_call_function_single(counter->cpu, + __perf_counter_remove_from_context, + counter, 1); + return; + } + +retry: + task_oncpu_function_call(task, __perf_counter_remove_from_context, + counter); + + spin_lock_irq(&ctx->lock); + /* + * If the context is active we need to retry the smp call. + */ + if (ctx->nr_active && !list_empty(&counter->list_entry)) { + spin_unlock_irq(&ctx->lock); + goto retry; + } + + /* + * The lock prevents that this context is scheduled in so we + * can remove the counter safely, if the call above did not + * succeed. + */ + if (!list_empty(&counter->list_entry)) { + ctx->nr_counters--; + list_del_counter(counter, ctx); + counter->task = NULL; + } + spin_unlock_irq(&ctx->lock); +} + +/* + * Cross CPU call to disable a performance counter + */ +static void __perf_counter_disable(void *info) +{ + struct perf_counter *counter = info; + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter_context *ctx = counter->ctx; + unsigned long flags; + + /* + * If this is a per-task counter, need to check whether this + * counter's task is the current task on this cpu. + */ + if (ctx->task && cpuctx->task_ctx != ctx) + return; + + curr_rq_lock_irq_save(&flags); + spin_lock(&ctx->lock); + + /* + * If the counter is on, turn it off. + * If it is in error state, leave it in error state. + */ + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) { + if (counter == counter->group_leader) + group_sched_out(counter, cpuctx, ctx); + else + counter_sched_out(counter, cpuctx, ctx); + counter->state = PERF_COUNTER_STATE_OFF; + } + + spin_unlock(&ctx->lock); + curr_rq_unlock_irq_restore(&flags); +} + +/* + * Disable a counter. + */ +static void perf_counter_disable(struct perf_counter *counter) +{ + struct perf_counter_context *ctx = counter->ctx; + struct task_struct *task = ctx->task; + + if (!task) { + /* + * Disable the counter on the cpu that it's on + */ + smp_call_function_single(counter->cpu, __perf_counter_disable, + counter, 1); + return; + } + + retry: + task_oncpu_function_call(task, __perf_counter_disable, counter); + + spin_lock_irq(&ctx->lock); + /* + * If the counter is still active, we need to retry the cross-call. + */ + if (counter->state == PERF_COUNTER_STATE_ACTIVE) { + spin_unlock_irq(&ctx->lock); + goto retry; + } + + /* + * Since we have the lock this context can't be scheduled + * in, so we can change the state safely. + */ + if (counter->state == PERF_COUNTER_STATE_INACTIVE) + counter->state = PERF_COUNTER_STATE_OFF; + + spin_unlock_irq(&ctx->lock); +} + +/* + * Disable a counter and all its children. + */ +static void perf_counter_disable_family(struct perf_counter *counter) +{ + struct perf_counter *child; + + perf_counter_disable(counter); + + /* + * Lock the mutex to protect the list of children + */ + mutex_lock(&counter->mutex); + list_for_each_entry(child, &counter->child_list, child_list) + perf_counter_disable(child); + mutex_unlock(&counter->mutex); +} + +static int +counter_sched_in(struct perf_counter *counter, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx, + int cpu) +{ + if (counter->state <= PERF_COUNTER_STATE_OFF) + return 0; + + counter->state = PERF_COUNTER_STATE_ACTIVE; + counter->oncpu = cpu; /* TODO: put 'cpu' into cpuctx->cpu */ + /* + * The new state must be visible before we turn it on in the hardware: + */ + smp_wmb(); + + if (counter->hw_ops->enable(counter)) { + counter->state = PERF_COUNTER_STATE_INACTIVE; + counter->oncpu = -1; + return -EAGAIN; + } + + if (!is_software_counter(counter)) + cpuctx->active_oncpu++; + ctx->nr_active++; + + if (counter->hw_event.exclusive) + cpuctx->exclusive = 1; + + return 0; +} + +/* + * Return 1 for a group consisting entirely of software counters, + * 0 if the group contains any hardware counters. + */ +static int is_software_only_group(struct perf_counter *leader) +{ + struct perf_counter *counter; + + if (!is_software_counter(leader)) + return 0; + list_for_each_entry(counter, &leader->sibling_list, list_entry) + if (!is_software_counter(counter)) + return 0; + return 1; +} + +/* + * Work out whether we can put this counter group on the CPU now. + */ +static int group_can_go_on(struct perf_counter *counter, + struct perf_cpu_context *cpuctx, + int can_add_hw) +{ + /* + * Groups consisting entirely of software counters can always go on. + */ + if (is_software_only_group(counter)) + return 1; + /* + * If an exclusive group is already on, no other hardware + * counters can go on. + */ + if (cpuctx->exclusive) + return 0; + /* + * If this group is exclusive and there are already + * counters on the CPU, it can't go on. + */ + if (counter->hw_event.exclusive && cpuctx->active_oncpu) + return 0; + /* + * Otherwise, try to add it if all previous groups were able + * to go on. + */ + return can_add_hw; +} + +/* + * Cross CPU call to install and enable a performance counter + */ +static void __perf_install_in_context(void *info) +{ + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter *counter = info; + struct perf_counter_context *ctx = counter->ctx; + struct perf_counter *leader = counter->group_leader; + int cpu = smp_processor_id(); + unsigned long flags; + u64 perf_flags; + int err; + + /* + * If this is a task context, we need to check whether it is + * the current task context of this cpu. If not it has been + * scheduled out before the smp call arrived. + */ + if (ctx->task && cpuctx->task_ctx != ctx) + return; + + curr_rq_lock_irq_save(&flags); + spin_lock(&ctx->lock); + + /* + * Protect the list operation against NMI by disabling the + * counters on a global level. NOP for non NMI based counters. + */ + perf_flags = hw_perf_save_disable(); + + list_add_counter(counter, ctx); + ctx->nr_counters++; + counter->prev_state = PERF_COUNTER_STATE_OFF; + + /* + * Don't put the counter on if it is disabled or if + * it is in a group and the group isn't on. + */ + if (counter->state != PERF_COUNTER_STATE_INACTIVE || + (leader != counter && leader->state != PERF_COUNTER_STATE_ACTIVE)) + goto unlock; + + /* + * An exclusive counter can't go on if there are already active + * hardware counters, and no hardware counter can go on if there + * is already an exclusive counter on. + */ + if (!group_can_go_on(counter, cpuctx, 1)) + err = -EEXIST; + else + err = counter_sched_in(counter, cpuctx, ctx, cpu); + + if (err) { + /* + * This counter couldn't go on. If it is in a group + * then we have to pull the whole group off. + * If the counter group is pinned then put it in error state. + */ + if (leader != counter) + group_sched_out(leader, cpuctx, ctx); + if (leader->hw_event.pinned) + leader->state = PERF_COUNTER_STATE_ERROR; + } + + if (!err && !ctx->task && cpuctx->max_pertask) + cpuctx->max_pertask--; + + unlock: + hw_perf_restore(perf_flags); + + spin_unlock(&ctx->lock); + curr_rq_unlock_irq_restore(&flags); +} + +/* + * Attach a performance counter to a context + * + * First we add the counter to the list with the hardware enable bit + * in counter->hw_config cleared. + * + * If the counter is attached to a task which is on a CPU we use a smp + * call to enable it in the task context. The task might have been + * scheduled away, but we check this in the smp call again. + * + * Must be called with ctx->mutex held. + */ +static void +perf_install_in_context(struct perf_counter_context *ctx, + struct perf_counter *counter, + int cpu) +{ + struct task_struct *task = ctx->task; + + if (!task) { + /* + * Per cpu counters are installed via an smp call and + * the install is always sucessful. + */ + smp_call_function_single(cpu, __perf_install_in_context, + counter, 1); + return; + } + + counter->task = task; +retry: + task_oncpu_function_call(task, __perf_install_in_context, + counter); + + spin_lock_irq(&ctx->lock); + /* + * we need to retry the smp call. + */ + if (ctx->is_active && list_empty(&counter->list_entry)) { + spin_unlock_irq(&ctx->lock); + goto retry; + } + + /* + * The lock prevents that this context is scheduled in so we + * can add the counter safely, if it the call above did not + * succeed. + */ + if (list_empty(&counter->list_entry)) { + list_add_counter(counter, ctx); + ctx->nr_counters++; + } + spin_unlock_irq(&ctx->lock); +} + +/* + * Cross CPU call to enable a performance counter + */ +static void __perf_counter_enable(void *info) +{ + struct perf_counter *counter = info; + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter_context *ctx = counter->ctx; + struct perf_counter *leader = counter->group_leader; + unsigned long flags; + int err; + + /* + * If this is a per-task counter, need to check whether this + * counter's task is the current task on this cpu. + */ + if (ctx->task && cpuctx->task_ctx != ctx) + return; + + curr_rq_lock_irq_save(&flags); + spin_lock(&ctx->lock); + + counter->prev_state = counter->state; + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + goto unlock; + counter->state = PERF_COUNTER_STATE_INACTIVE; + + /* + * If the counter is in a group and isn't the group leader, + * then don't put it on unless the group is on. + */ + if (leader != counter && leader->state != PERF_COUNTER_STATE_ACTIVE) + goto unlock; + + if (!group_can_go_on(counter, cpuctx, 1)) + err = -EEXIST; + else + err = counter_sched_in(counter, cpuctx, ctx, + smp_processor_id()); + + if (err) { + /* + * If this counter can't go on and it's part of a + * group, then the whole group has to come off. + */ + if (leader != counter) + group_sched_out(leader, cpuctx, ctx); + if (leader->hw_event.pinned) + leader->state = PERF_COUNTER_STATE_ERROR; + } + + unlock: + spin_unlock(&ctx->lock); + curr_rq_unlock_irq_restore(&flags); +} + +/* + * Enable a counter. + */ +static void perf_counter_enable(struct perf_counter *counter) +{ + struct perf_counter_context *ctx = counter->ctx; + struct task_struct *task = ctx->task; + + if (!task) { + /* + * Enable the counter on the cpu that it's on + */ + smp_call_function_single(counter->cpu, __perf_counter_enable, + counter, 1); + return; + } + + spin_lock_irq(&ctx->lock); + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + goto out; + + /* + * If the counter is in error state, clear that first. + * That way, if we see the counter in error state below, we + * know that it has gone back into error state, as distinct + * from the task having been scheduled away before the + * cross-call arrived. + */ + if (counter->state == PERF_COUNTER_STATE_ERROR) + counter->state = PERF_COUNTER_STATE_OFF; + + retry: + spin_unlock_irq(&ctx->lock); + task_oncpu_function_call(task, __perf_counter_enable, counter); + + spin_lock_irq(&ctx->lock); + + /* + * If the context is active and the counter is still off, + * we need to retry the cross-call. + */ + if (ctx->is_active && counter->state == PERF_COUNTER_STATE_OFF) + goto retry; + + /* + * Since we have the lock this context can't be scheduled + * in, so we can change the state safely. + */ + if (counter->state == PERF_COUNTER_STATE_OFF) + counter->state = PERF_COUNTER_STATE_INACTIVE; + out: + spin_unlock_irq(&ctx->lock); +} + +/* + * Enable a counter and all its children. + */ +static void perf_counter_enable_family(struct perf_counter *counter) +{ + struct perf_counter *child; + + perf_counter_enable(counter); + + /* + * Lock the mutex to protect the list of children + */ + mutex_lock(&counter->mutex); + list_for_each_entry(child, &counter->child_list, child_list) + perf_counter_enable(child); + mutex_unlock(&counter->mutex); +} + +void __perf_counter_sched_out(struct perf_counter_context *ctx, + struct perf_cpu_context *cpuctx) +{ + struct perf_counter *counter; + u64 flags; + + spin_lock(&ctx->lock); + ctx->is_active = 0; + if (likely(!ctx->nr_counters)) + goto out; + + flags = hw_perf_save_disable(); + if (ctx->nr_active) { + list_for_each_entry(counter, &ctx->counter_list, list_entry) + group_sched_out(counter, cpuctx, ctx); + } + hw_perf_restore(flags); + out: + spin_unlock(&ctx->lock); +} + +/* + * Called from scheduler to remove the counters of the current task, + * with interrupts disabled. + * + * We stop each counter and update the counter value in counter->count. + * + * This does not protect us against NMI, but disable() + * sets the disabled bit in the control field of counter _before_ + * accessing the counter control register. If a NMI hits, then it will + * not restart the counter. + */ +void perf_counter_task_sched_out(struct task_struct *task, int cpu) +{ + struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); + struct perf_counter_context *ctx = &task->perf_counter_ctx; + + if (likely(!cpuctx->task_ctx)) + return; + + __perf_counter_sched_out(ctx, cpuctx); + + cpuctx->task_ctx = NULL; +} + +static void perf_counter_cpu_sched_out(struct perf_cpu_context *cpuctx) +{ + __perf_counter_sched_out(&cpuctx->ctx, cpuctx); +} + +static int +group_sched_in(struct perf_counter *group_counter, + struct perf_cpu_context *cpuctx, + struct perf_counter_context *ctx, + int cpu) +{ + struct perf_counter *counter, *partial_group; + int ret; + + if (group_counter->state == PERF_COUNTER_STATE_OFF) + return 0; + + ret = hw_perf_group_sched_in(group_counter, cpuctx, ctx, cpu); + if (ret) + return ret < 0 ? ret : 0; + + group_counter->prev_state = group_counter->state; + if (counter_sched_in(group_counter, cpuctx, ctx, cpu)) + return -EAGAIN; + + /* + * Schedule in siblings as one group (if any): + */ + list_for_each_entry(counter, &group_counter->sibling_list, list_entry) { + counter->prev_state = counter->state; + if (counter_sched_in(counter, cpuctx, ctx, cpu)) { + partial_group = counter; + goto group_error; + } + } + + return 0; + +group_error: + /* + * Groups can be scheduled in as one unit only, so undo any + * partial group before returning: + */ + list_for_each_entry(counter, &group_counter->sibling_list, list_entry) { + if (counter == partial_group) + break; + counter_sched_out(counter, cpuctx, ctx); + } + counter_sched_out(group_counter, cpuctx, ctx); + + return -EAGAIN; +} + +static void +__perf_counter_sched_in(struct perf_counter_context *ctx, + struct perf_cpu_context *cpuctx, int cpu) +{ + struct perf_counter *counter; + u64 flags; + int can_add_hw = 1; + + spin_lock(&ctx->lock); + ctx->is_active = 1; + if (likely(!ctx->nr_counters)) + goto out; + + flags = hw_perf_save_disable(); + + /* + * First go through the list and put on any pinned groups + * in order to give them the best chance of going on. + */ + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + if (counter->state <= PERF_COUNTER_STATE_OFF || + !counter->hw_event.pinned) + continue; + if (counter->cpu != -1 && counter->cpu != cpu) + continue; + + if (group_can_go_on(counter, cpuctx, 1)) + group_sched_in(counter, cpuctx, ctx, cpu); + + /* + * If this pinned group hasn't been scheduled, + * put it in error state. + */ + if (counter->state == PERF_COUNTER_STATE_INACTIVE) + counter->state = PERF_COUNTER_STATE_ERROR; + } + + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + /* + * Ignore counters in OFF or ERROR state, and + * ignore pinned counters since we did them already. + */ + if (counter->state <= PERF_COUNTER_STATE_OFF || + counter->hw_event.pinned) + continue; + + /* + * Listen to the 'cpu' scheduling filter constraint + * of counters: + */ + if (counter->cpu != -1 && counter->cpu != cpu) + continue; + + if (group_can_go_on(counter, cpuctx, can_add_hw)) { + if (group_sched_in(counter, cpuctx, ctx, cpu)) + can_add_hw = 0; + } + } + hw_perf_restore(flags); + out: + spin_unlock(&ctx->lock); +} + +/* + * Called from scheduler to add the counters of the current task + * with interrupts disabled. + * + * We restore the counter value and then enable it. + * + * This does not protect us against NMI, but enable() + * sets the enabled bit in the control field of counter _before_ + * accessing the counter control register. If a NMI hits, then it will + * keep the counter running. + */ +void perf_counter_task_sched_in(struct task_struct *task, int cpu) +{ + struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); + struct perf_counter_context *ctx = &task->perf_counter_ctx; + + __perf_counter_sched_in(ctx, cpuctx, cpu); + cpuctx->task_ctx = ctx; +} + +static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu) +{ + struct perf_counter_context *ctx = &cpuctx->ctx; + + __perf_counter_sched_in(ctx, cpuctx, cpu); +} + +int perf_counter_task_disable(void) +{ + struct task_struct *curr = current; + struct perf_counter_context *ctx = &curr->perf_counter_ctx; + struct perf_counter *counter; + unsigned long flags; + u64 perf_flags; + int cpu; + + if (likely(!ctx->nr_counters)) + return 0; + + curr_rq_lock_irq_save(&flags); + cpu = smp_processor_id(); + + /* force the update of the task clock: */ + __task_delta_exec(curr, 1); + + perf_counter_task_sched_out(curr, cpu); + + spin_lock(&ctx->lock); + + /* + * Disable all the counters: + */ + perf_flags = hw_perf_save_disable(); + + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + if (counter->state != PERF_COUNTER_STATE_ERROR) + counter->state = PERF_COUNTER_STATE_OFF; + } + + hw_perf_restore(perf_flags); + + spin_unlock(&ctx->lock); + + curr_rq_unlock_irq_restore(&flags); + + return 0; +} + +int perf_counter_task_enable(void) +{ + struct task_struct *curr = current; + struct perf_counter_context *ctx = &curr->perf_counter_ctx; + struct perf_counter *counter; + unsigned long flags; + u64 perf_flags; + int cpu; + + if (likely(!ctx->nr_counters)) + return 0; + + curr_rq_lock_irq_save(&flags); + cpu = smp_processor_id(); + + /* force the update of the task clock: */ + __task_delta_exec(curr, 1); + + perf_counter_task_sched_out(curr, cpu); + + spin_lock(&ctx->lock); + + /* + * Disable all the counters: + */ + perf_flags = hw_perf_save_disable(); + + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + if (counter->state > PERF_COUNTER_STATE_OFF) + continue; + counter->state = PERF_COUNTER_STATE_INACTIVE; + counter->hw_event.disabled = 0; + } + hw_perf_restore(perf_flags); + + spin_unlock(&ctx->lock); + + perf_counter_task_sched_in(curr, cpu); + + curr_rq_unlock_irq_restore(&flags); + + return 0; +} + +/* + * Round-robin a context's counters: + */ +static void rotate_ctx(struct perf_counter_context *ctx) +{ + struct perf_counter *counter; + u64 perf_flags; + + if (!ctx->nr_counters) + return; + + spin_lock(&ctx->lock); + /* + * Rotate the first entry last (works just fine for group counters too): + */ + perf_flags = hw_perf_save_disable(); + list_for_each_entry(counter, &ctx->counter_list, list_entry) { + list_del(&counter->list_entry); + list_add_tail(&counter->list_entry, &ctx->counter_list); + break; + } + hw_perf_restore(perf_flags); + + spin_unlock(&ctx->lock); +} + +void perf_counter_task_tick(struct task_struct *curr, int cpu) +{ + struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); + struct perf_counter_context *ctx = &curr->perf_counter_ctx; + const int rotate_percpu = 0; + + if (rotate_percpu) + perf_counter_cpu_sched_out(cpuctx); + perf_counter_task_sched_out(curr, cpu); + + if (rotate_percpu) + rotate_ctx(&cpuctx->ctx); + rotate_ctx(ctx); + + if (rotate_percpu) + perf_counter_cpu_sched_in(cpuctx, cpu); + perf_counter_task_sched_in(curr, cpu); +} + +/* + * Cross CPU call to read the hardware counter + */ +static void __read(void *info) +{ + struct perf_counter *counter = info; + unsigned long flags; + + curr_rq_lock_irq_save(&flags); + counter->hw_ops->read(counter); + curr_rq_unlock_irq_restore(&flags); +} + +static u64 perf_counter_read(struct perf_counter *counter) +{ + /* + * If counter is enabled and currently active on a CPU, update the + * value in the counter structure: + */ + if (counter->state == PERF_COUNTER_STATE_ACTIVE) { + smp_call_function_single(counter->oncpu, + __read, counter, 1); + } + + return atomic64_read(&counter->count); +} + +/* + * Cross CPU call to switch performance data pointers + */ +static void __perf_switch_irq_data(void *info) +{ + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter *counter = info; + struct perf_counter_context *ctx = counter->ctx; + struct perf_data *oldirqdata = counter->irqdata; + + /* + * If this is a task context, we need to check whether it is + * the current task context of this cpu. If not it has been + * scheduled out before the smp call arrived. + */ + if (ctx->task) { + if (cpuctx->task_ctx != ctx) + return; + spin_lock(&ctx->lock); + } + + /* Change the pointer NMI safe */ + atomic_long_set((atomic_long_t *)&counter->irqdata, + (unsigned long) counter->usrdata); + counter->usrdata = oldirqdata; + + if (ctx->task) + spin_unlock(&ctx->lock); +} + +static struct perf_data *perf_switch_irq_data(struct perf_counter *counter) +{ + struct perf_counter_context *ctx = counter->ctx; + struct perf_data *oldirqdata = counter->irqdata; + struct task_struct *task = ctx->task; + + if (!task) { + smp_call_function_single(counter->cpu, + __perf_switch_irq_data, + counter, 1); + return counter->usrdata; + } + +retry: + spin_lock_irq(&ctx->lock); + if (counter->state != PERF_COUNTER_STATE_ACTIVE) { + counter->irqdata = counter->usrdata; + counter->usrdata = oldirqdata; + spin_unlock_irq(&ctx->lock); + return oldirqdata; + } + spin_unlock_irq(&ctx->lock); + task_oncpu_function_call(task, __perf_switch_irq_data, counter); + /* Might have failed, because task was scheduled out */ + if (counter->irqdata == oldirqdata) + goto retry; + + return counter->usrdata; +} + +static void put_context(struct perf_counter_context *ctx) +{ + if (ctx->task) + put_task_struct(ctx->task); +} + +static struct perf_counter_context *find_get_context(pid_t pid, int cpu) +{ + struct perf_cpu_context *cpuctx; + struct perf_counter_context *ctx; + struct task_struct *task; + + /* + * If cpu is not a wildcard then this is a percpu counter: + */ + if (cpu != -1) { + /* Must be root to operate on a CPU counter: */ + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EACCES); + + if (cpu < 0 || cpu > num_possible_cpus()) + return ERR_PTR(-EINVAL); + + /* + * We could be clever and allow to attach a counter to an + * offline CPU and activate it when the CPU comes up, but + * that's for later. + */ + if (!cpu_isset(cpu, cpu_online_map)) + return ERR_PTR(-ENODEV); + + cpuctx = &per_cpu(perf_cpu_context, cpu); + ctx = &cpuctx->ctx; + + return ctx; + } + + rcu_read_lock(); + if (!pid) + task = current; + else + task = find_task_by_vpid(pid); + if (task) + get_task_struct(task); + rcu_read_unlock(); + + if (!task) + return ERR_PTR(-ESRCH); + + ctx = &task->perf_counter_ctx; + ctx->task = task; + + /* Reuse ptrace permission checks for now. */ + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { + put_context(ctx); + return ERR_PTR(-EACCES); + } + + return ctx; +} + +/* + * Called when the last reference to the file is gone. + */ +static int perf_release(struct inode *inode, struct file *file) +{ + struct perf_counter *counter = file->private_data; + struct perf_counter_context *ctx = counter->ctx; + + file->private_data = NULL; + + mutex_lock(&ctx->mutex); + mutex_lock(&counter->mutex); + + perf_counter_remove_from_context(counter); + + mutex_unlock(&counter->mutex); + mutex_unlock(&ctx->mutex); + + kfree(counter); + put_context(ctx); + + return 0; +} + +/* + * Read the performance counter - simple non blocking version for now + */ +static ssize_t +perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count) +{ + u64 cntval; + + if (count != sizeof(cntval)) + return -EINVAL; + + /* + * Return end-of-file for a read on a counter that is in + * error state (i.e. because it was pinned but it couldn't be + * scheduled on to the CPU at some point). + */ + if (counter->state == PERF_COUNTER_STATE_ERROR) + return 0; + + mutex_lock(&counter->mutex); + cntval = perf_counter_read(counter); + mutex_unlock(&counter->mutex); + + return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval); +} + +static ssize_t +perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count) +{ + if (!usrdata->len) + return 0; + + count = min(count, (size_t)usrdata->len); + if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count)) + return -EFAULT; + + /* Adjust the counters */ + usrdata->len -= count; + if (!usrdata->len) + usrdata->rd_idx = 0; + else + usrdata->rd_idx += count; + + return count; +} + +static ssize_t +perf_read_irq_data(struct perf_counter *counter, + char __user *buf, + size_t count, + int nonblocking) +{ + struct perf_data *irqdata, *usrdata; + DECLARE_WAITQUEUE(wait, current); + ssize_t res, res2; + + irqdata = counter->irqdata; + usrdata = counter->usrdata; + + if (usrdata->len + irqdata->len >= count) + goto read_pending; + + if (nonblocking) + return -EAGAIN; + + spin_lock_irq(&counter->waitq.lock); + __add_wait_queue(&counter->waitq, &wait); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + if (usrdata->len + irqdata->len >= count) + break; + + if (signal_pending(current)) + break; + + if (counter->state == PERF_COUNTER_STATE_ERROR) + break; + + spin_unlock_irq(&counter->waitq.lock); + schedule(); + spin_lock_irq(&counter->waitq.lock); + } + __remove_wait_queue(&counter->waitq, &wait); + __set_current_state(TASK_RUNNING); + spin_unlock_irq(&counter->waitq.lock); + + if (usrdata->len + irqdata->len < count && + counter->state != PERF_COUNTER_STATE_ERROR) + return -ERESTARTSYS; +read_pending: + mutex_lock(&counter->mutex); + + /* Drain pending data first: */ + res = perf_copy_usrdata(usrdata, buf, count); + if (res < 0 || res == count) + goto out; + + /* Switch irq buffer: */ + usrdata = perf_switch_irq_data(counter); + res2 = perf_copy_usrdata(usrdata, buf + res, count - res); + if (res2 < 0) { + if (!res) + res = -EFAULT; + } else { + res += res2; + } +out: + mutex_unlock(&counter->mutex); + + return res; +} + +static ssize_t +perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + struct perf_counter *counter = file->private_data; + + switch (counter->hw_event.record_type) { + case PERF_RECORD_SIMPLE: + return perf_read_hw(counter, buf, count); + + case PERF_RECORD_IRQ: + case PERF_RECORD_GROUP: + return perf_read_irq_data(counter, buf, count, + file->f_flags & O_NONBLOCK); + } + return -EINVAL; +} + +static unsigned int perf_poll(struct file *file, poll_table *wait) +{ + struct perf_counter *counter = file->private_data; + unsigned int events = 0; + unsigned long flags; + + poll_wait(file, &counter->waitq, wait); + + spin_lock_irqsave(&counter->waitq.lock, flags); + if (counter->usrdata->len || counter->irqdata->len) + events |= POLLIN; + spin_unlock_irqrestore(&counter->waitq.lock, flags); + + return events; +} + +static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct perf_counter *counter = file->private_data; + int err = 0; + + switch (cmd) { + case PERF_COUNTER_IOC_ENABLE: + perf_counter_enable_family(counter); + break; + case PERF_COUNTER_IOC_DISABLE: + perf_counter_disable_family(counter); + break; + default: + err = -ENOTTY; + } + return err; +} + +static const struct file_operations perf_fops = { + .release = perf_release, + .read = perf_read, + .poll = perf_poll, + .unlocked_ioctl = perf_ioctl, + .compat_ioctl = perf_ioctl, +}; + +static int cpu_clock_perf_counter_enable(struct perf_counter *counter) +{ + int cpu = raw_smp_processor_id(); + + atomic64_set(&counter->hw.prev_count, cpu_clock(cpu)); + return 0; +} + +static void cpu_clock_perf_counter_update(struct perf_counter *counter) +{ + int cpu = raw_smp_processor_id(); + s64 prev; + u64 now; + + now = cpu_clock(cpu); + prev = atomic64_read(&counter->hw.prev_count); + atomic64_set(&counter->hw.prev_count, now); + atomic64_add(now - prev, &counter->count); +} + +static void cpu_clock_perf_counter_disable(struct perf_counter *counter) +{ + cpu_clock_perf_counter_update(counter); +} + +static void cpu_clock_perf_counter_read(struct perf_counter *counter) +{ + cpu_clock_perf_counter_update(counter); +} + +static const struct hw_perf_counter_ops perf_ops_cpu_clock = { + .enable = cpu_clock_perf_counter_enable, + .disable = cpu_clock_perf_counter_disable, + .read = cpu_clock_perf_counter_read, +}; + +/* + * Called from within the scheduler: + */ +static u64 task_clock_perf_counter_val(struct perf_counter *counter, int update) +{ + struct task_struct *curr = counter->task; + u64 delta; + + delta = __task_delta_exec(curr, update); + + return curr->se.sum_exec_runtime + delta; +} + +static void task_clock_perf_counter_update(struct perf_counter *counter, u64 now) +{ + u64 prev; + s64 delta; + + prev = atomic64_read(&counter->hw.prev_count); + + atomic64_set(&counter->hw.prev_count, now); + + delta = now - prev; + + atomic64_add(delta, &counter->count); +} + +static void task_clock_perf_counter_read(struct perf_counter *counter) +{ + u64 now = task_clock_perf_counter_val(counter, 1); + + task_clock_perf_counter_update(counter, now); +} + +static int task_clock_perf_counter_enable(struct perf_counter *counter) +{ + if (counter->prev_state <= PERF_COUNTER_STATE_OFF) + atomic64_set(&counter->hw.prev_count, + task_clock_perf_counter_val(counter, 0)); + + return 0; +} + +static void task_clock_perf_counter_disable(struct perf_counter *counter) +{ + u64 now = task_clock_perf_counter_val(counter, 0); + + task_clock_perf_counter_update(counter, now); +} + +static const struct hw_perf_counter_ops perf_ops_task_clock = { + .enable = task_clock_perf_counter_enable, + .disable = task_clock_perf_counter_disable, + .read = task_clock_perf_counter_read, +}; + +#ifdef CONFIG_VM_EVENT_COUNTERS +#define cpu_page_faults() __get_cpu_var(vm_event_states).event[PGFAULT] +#else +#define cpu_page_faults() 0 +#endif + +static u64 get_page_faults(struct perf_counter *counter) +{ + struct task_struct *curr = counter->ctx->task; + + if (curr) + return curr->maj_flt + curr->min_flt; + return cpu_page_faults(); +} + +static void page_faults_perf_counter_update(struct perf_counter *counter) +{ + u64 prev, now; + s64 delta; + + prev = atomic64_read(&counter->hw.prev_count); + now = get_page_faults(counter); + + atomic64_set(&counter->hw.prev_count, now); + + delta = now - prev; + + atomic64_add(delta, &counter->count); +} + +static void page_faults_perf_counter_read(struct perf_counter *counter) +{ + page_faults_perf_counter_update(counter); +} + +static int page_faults_perf_counter_enable(struct perf_counter *counter) +{ + if (counter->prev_state <= PERF_COUNTER_STATE_OFF) + atomic64_set(&counter->hw.prev_count, get_page_faults(counter)); + return 0; +} + +static void page_faults_perf_counter_disable(struct perf_counter *counter) +{ + page_faults_perf_counter_update(counter); +} + +static const struct hw_perf_counter_ops perf_ops_page_faults = { + .enable = page_faults_perf_counter_enable, + .disable = page_faults_perf_counter_disable, + .read = page_faults_perf_counter_read, +}; + +static u64 get_context_switches(struct perf_counter *counter) +{ + struct task_struct *curr = counter->ctx->task; + + if (curr) + return curr->nvcsw + curr->nivcsw; + return cpu_nr_switches(smp_processor_id()); +} + +static void context_switches_perf_counter_update(struct perf_counter *counter) +{ + u64 prev, now; + s64 delta; + + prev = atomic64_read(&counter->hw.prev_count); + now = get_context_switches(counter); + + atomic64_set(&counter->hw.prev_count, now); + + delta = now - prev; + + atomic64_add(delta, &counter->count); +} + +static void context_switches_perf_counter_read(struct perf_counter *counter) +{ + context_switches_perf_counter_update(counter); +} + +static int context_switches_perf_counter_enable(struct perf_counter *counter) +{ + if (counter->prev_state <= PERF_COUNTER_STATE_OFF) + atomic64_set(&counter->hw.prev_count, + get_context_switches(counter)); + return 0; +} + +static void context_switches_perf_counter_disable(struct perf_counter *counter) +{ + context_switches_perf_counter_update(counter); +} + +static const struct hw_perf_counter_ops perf_ops_context_switches = { + .enable = context_switches_perf_counter_enable, + .disable = context_switches_perf_counter_disable, + .read = context_switches_perf_counter_read, +}; + +static inline u64 get_cpu_migrations(struct perf_counter *counter) +{ + struct task_struct *curr = counter->ctx->task; + + if (curr) + return curr->se.nr_migrations; + return cpu_nr_migrations(smp_processor_id()); +} + +static void cpu_migrations_perf_counter_update(struct perf_counter *counter) +{ + u64 prev, now; + s64 delta; + + prev = atomic64_read(&counter->hw.prev_count); + now = get_cpu_migrations(counter); + + atomic64_set(&counter->hw.prev_count, now); + + delta = now - prev; + + atomic64_add(delta, &counter->count); +} + +static void cpu_migrations_perf_counter_read(struct perf_counter *counter) +{ + cpu_migrations_perf_counter_update(counter); +} + +static int cpu_migrations_perf_counter_enable(struct perf_counter *counter) +{ + if (counter->prev_state <= PERF_COUNTER_STATE_OFF) + atomic64_set(&counter->hw.prev_count, + get_cpu_migrations(counter)); + return 0; +} + +static void cpu_migrations_perf_counter_disable(struct perf_counter *counter) +{ + cpu_migrations_perf_counter_update(counter); +} + +static const struct hw_perf_counter_ops perf_ops_cpu_migrations = { + .enable = cpu_migrations_perf_counter_enable, + .disable = cpu_migrations_perf_counter_disable, + .read = cpu_migrations_perf_counter_read, +}; + +static const struct hw_perf_counter_ops * +sw_perf_counter_init(struct perf_counter *counter) +{ + const struct hw_perf_counter_ops *hw_ops = NULL; + + /* + * Software counters (currently) can't in general distinguish + * between user, kernel and hypervisor events. + * However, context switches and cpu migrations are considered + * to be kernel events, and page faults are never hypervisor + * events. + */ + switch (counter->hw_event.type) { + case PERF_COUNT_CPU_CLOCK: + if (!(counter->hw_event.exclude_user || + counter->hw_event.exclude_kernel || + counter->hw_event.exclude_hv)) + hw_ops = &perf_ops_cpu_clock; + break; + case PERF_COUNT_TASK_CLOCK: + if (counter->hw_event.exclude_user || + counter->hw_event.exclude_kernel || + counter->hw_event.exclude_hv) + break; + /* + * If the user instantiates this as a per-cpu counter, + * use the cpu_clock counter instead. + */ + if (counter->ctx->task) + hw_ops = &perf_ops_task_clock; + else + hw_ops = &perf_ops_cpu_clock; + break; + case PERF_COUNT_PAGE_FAULTS: + if (!(counter->hw_event.exclude_user || + counter->hw_event.exclude_kernel)) + hw_ops = &perf_ops_page_faults; + break; + case PERF_COUNT_CONTEXT_SWITCHES: + if (!counter->hw_event.exclude_kernel) + hw_ops = &perf_ops_context_switches; + break; + case PERF_COUNT_CPU_MIGRATIONS: + if (!counter->hw_event.exclude_kernel) + hw_ops = &perf_ops_cpu_migrations; + break; + default: + break; + } + return hw_ops; +} + +/* + * Allocate and initialize a counter structure + */ +static struct perf_counter * +perf_counter_alloc(struct perf_counter_hw_event *hw_event, + int cpu, + struct perf_counter_context *ctx, + struct perf_counter *group_leader, + gfp_t gfpflags) +{ + const struct hw_perf_counter_ops *hw_ops; + struct perf_counter *counter; + + counter = kzalloc(sizeof(*counter), gfpflags); + if (!counter) + return NULL; + + /* + * Single counters are their own group leaders, with an + * empty sibling list: + */ + if (!group_leader) + group_leader = counter; + + mutex_init(&counter->mutex); + INIT_LIST_HEAD(&counter->list_entry); + INIT_LIST_HEAD(&counter->sibling_list); + init_waitqueue_head(&counter->waitq); + + INIT_LIST_HEAD(&counter->child_list); + + counter->irqdata = &counter->data[0]; + counter->usrdata = &counter->data[1]; + counter->cpu = cpu; + counter->hw_event = *hw_event; + counter->wakeup_pending = 0; + counter->group_leader = group_leader; + counter->hw_ops = NULL; + counter->ctx = ctx; + + counter->state = PERF_COUNTER_STATE_INACTIVE; + if (hw_event->disabled) + counter->state = PERF_COUNTER_STATE_OFF; + + hw_ops = NULL; + if (!hw_event->raw && hw_event->type < 0) + hw_ops = sw_perf_counter_init(counter); + else + hw_ops = hw_perf_counter_init(counter); + + if (!hw_ops) { + kfree(counter); + return NULL; + } + counter->hw_ops = hw_ops; + + return counter; +} + +/** + * sys_perf_task_open - open a performance counter, associate it to a task/cpu + * + * @hw_event_uptr: event type attributes for monitoring/sampling + * @pid: target pid + * @cpu: target cpu + * @group_fd: group leader counter fd + */ +asmlinkage int +sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr __user, + pid_t pid, int cpu, int group_fd) +{ + struct perf_counter *counter, *group_leader; + struct perf_counter_hw_event hw_event; + struct perf_counter_context *ctx; + struct file *counter_file = NULL; + struct file *group_file = NULL; + int fput_needed = 0; + int fput_needed2 = 0; + int ret; + + if (copy_from_user(&hw_event, hw_event_uptr, sizeof(hw_event)) != 0) + return -EFAULT; + + /* + * Get the target context (task or percpu): + */ + ctx = find_get_context(pid, cpu); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + /* + * Look up the group leader (we will attach this counter to it): + */ + group_leader = NULL; + if (group_fd != -1) { + ret = -EINVAL; + group_file = fget_light(group_fd, &fput_needed); + if (!group_file) + goto err_put_context; + if (group_file->f_op != &perf_fops) + goto err_put_context; + + group_leader = group_file->private_data; + /* + * Do not allow a recursive hierarchy (this new sibling + * becoming part of another group-sibling): + */ + if (group_leader->group_leader != group_leader) + goto err_put_context; + /* + * Do not allow to attach to a group in a different + * task or CPU context: + */ + if (group_leader->ctx != ctx) + goto err_put_context; + /* + * Only a group leader can be exclusive or pinned + */ + if (hw_event.exclusive || hw_event.pinned) + goto err_put_context; + } + + ret = -EINVAL; + counter = perf_counter_alloc(&hw_event, cpu, ctx, group_leader, + GFP_KERNEL); + if (!counter) + goto err_put_context; + + ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0); + if (ret < 0) + goto err_free_put_context; + + counter_file = fget_light(ret, &fput_needed2); + if (!counter_file) + goto err_free_put_context; + + counter->filp = counter_file; + mutex_lock(&ctx->mutex); + perf_install_in_context(ctx, counter, cpu); + mutex_unlock(&ctx->mutex); + + fput_light(counter_file, fput_needed2); + +out_fput: + fput_light(group_file, fput_needed); + + return ret; + +err_free_put_context: + kfree(counter); + +err_put_context: + put_context(ctx); + + goto out_fput; +} + +/* + * Initialize the perf_counter context in a task_struct: + */ +static void +__perf_counter_init_context(struct perf_counter_context *ctx, + struct task_struct *task) +{ + memset(ctx, 0, sizeof(*ctx)); + spin_lock_init(&ctx->lock); + mutex_init(&ctx->mutex); + INIT_LIST_HEAD(&ctx->counter_list); + ctx->task = task; +} + +/* + * inherit a counter from parent task to child task: + */ +static struct perf_counter * +inherit_counter(struct perf_counter *parent_counter, + struct task_struct *parent, + struct perf_counter_context *parent_ctx, + struct task_struct *child, + struct perf_counter *group_leader, + struct perf_counter_context *child_ctx) +{ + struct perf_counter *child_counter; + + /* + * Instead of creating recursive hierarchies of counters, + * we link inherited counters back to the original parent, + * which has a filp for sure, which we use as the reference + * count: + */ + if (parent_counter->parent) + parent_counter = parent_counter->parent; + + child_counter = perf_counter_alloc(&parent_counter->hw_event, + parent_counter->cpu, child_ctx, + group_leader, GFP_KERNEL); + if (!child_counter) + return NULL; + + /* + * Link it up in the child's context: + */ + child_counter->task = child; + list_add_counter(child_counter, child_ctx); + child_ctx->nr_counters++; + + child_counter->parent = parent_counter; + /* + * inherit into child's child as well: + */ + child_counter->hw_event.inherit = 1; + + /* + * Get a reference to the parent filp - we will fput it + * when the child counter exits. This is safe to do because + * we are in the parent and we know that the filp still + * exists and has a nonzero count: + */ + atomic_long_inc(&parent_counter->filp->f_count); + + /* + * Link this into the parent counter's child list + */ + mutex_lock(&parent_counter->mutex); + list_add_tail(&child_counter->child_list, &parent_counter->child_list); + + /* + * Make the child state follow the state of the parent counter, + * not its hw_event.disabled bit. We hold the parent's mutex, + * so we won't race with perf_counter_{en,dis}able_family. + */ + if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) + child_counter->state = PERF_COUNTER_STATE_INACTIVE; + else + child_counter->state = PERF_COUNTER_STATE_OFF; + + mutex_unlock(&parent_counter->mutex); + + return child_counter; +} + +static int inherit_group(struct perf_counter *parent_counter, + struct task_struct *parent, + struct perf_counter_context *parent_ctx, + struct task_struct *child, + struct perf_counter_context *child_ctx) +{ + struct perf_counter *leader; + struct perf_counter *sub; + + leader = inherit_counter(parent_counter, parent, parent_ctx, + child, NULL, child_ctx); + if (!leader) + return -ENOMEM; + list_for_each_entry(sub, &parent_counter->sibling_list, list_entry) { + if (!inherit_counter(sub, parent, parent_ctx, + child, leader, child_ctx)) + return -ENOMEM; + } + return 0; +} + +static void sync_child_counter(struct perf_counter *child_counter, + struct perf_counter *parent_counter) +{ + u64 parent_val, child_val; + + parent_val = atomic64_read(&parent_counter->count); + child_val = atomic64_read(&child_counter->count); + + /* + * Add back the child's count to the parent's count: + */ + atomic64_add(child_val, &parent_counter->count); + + /* + * Remove this counter from the parent's list + */ + mutex_lock(&parent_counter->mutex); + list_del_init(&child_counter->child_list); + mutex_unlock(&parent_counter->mutex); + + /* + * Release the parent counter, if this was the last + * reference to it. + */ + fput(parent_counter->filp); +} + +static void +__perf_counter_exit_task(struct task_struct *child, + struct perf_counter *child_counter, + struct perf_counter_context *child_ctx) +{ + struct perf_counter *parent_counter; + struct perf_counter *sub, *tmp; + + /* + * If we do not self-reap then we have to wait for the + * child task to unschedule (it will happen for sure), + * so that its counter is at its final count. (This + * condition triggers rarely - child tasks usually get + * off their CPU before the parent has a chance to + * get this far into the reaping action) + */ + if (child != current) { + wait_task_inactive(child, 0); + list_del_init(&child_counter->list_entry); + } else { + struct perf_cpu_context *cpuctx; + unsigned long flags; + u64 perf_flags; + + /* + * Disable and unlink this counter. + * + * Be careful about zapping the list - IRQ/NMI context + * could still be processing it: + */ + curr_rq_lock_irq_save(&flags); + perf_flags = hw_perf_save_disable(); + + cpuctx = &__get_cpu_var(perf_cpu_context); + + group_sched_out(child_counter, cpuctx, child_ctx); + + list_del_init(&child_counter->list_entry); + + child_ctx->nr_counters--; + + hw_perf_restore(perf_flags); + curr_rq_unlock_irq_restore(&flags); + } + + parent_counter = child_counter->parent; + /* + * It can happen that parent exits first, and has counters + * that are still around due to the child reference. These + * counters need to be zapped - but otherwise linger. + */ + if (parent_counter) { + sync_child_counter(child_counter, parent_counter); + list_for_each_entry_safe(sub, tmp, &child_counter->sibling_list, + list_entry) { + if (sub->parent) { + sync_child_counter(sub, sub->parent); + kfree(sub); + } + } + kfree(child_counter); + } +} + +/* + * When a child task exits, feed back counter values to parent counters. + * + * Note: we may be running in child context, but the PID is not hashed + * anymore so new counters will not be added. + */ +void perf_counter_exit_task(struct task_struct *child) +{ + struct perf_counter *child_counter, *tmp; + struct perf_counter_context *child_ctx; + + child_ctx = &child->perf_counter_ctx; + + if (likely(!child_ctx->nr_counters)) + return; + + list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list, + list_entry) + __perf_counter_exit_task(child, child_counter, child_ctx); +} + +/* + * Initialize the perf_counter context in task_struct + */ +void perf_counter_init_task(struct task_struct *child) +{ + struct perf_counter_context *child_ctx, *parent_ctx; + struct perf_counter *counter; + struct task_struct *parent = current; + + child_ctx = &child->perf_counter_ctx; + parent_ctx = &parent->perf_counter_ctx; + + __perf_counter_init_context(child_ctx, child); + + /* + * This is executed from the parent task context, so inherit + * counters that have been marked for cloning: + */ + + if (likely(!parent_ctx->nr_counters)) + return; + + /* + * Lock the parent list. No need to lock the child - not PID + * hashed yet and not running, so nobody can access it. + */ + mutex_lock(&parent_ctx->mutex); + + /* + * We dont have to disable NMIs - we are only looking at + * the list, not manipulating it: + */ + list_for_each_entry(counter, &parent_ctx->counter_list, list_entry) { + if (!counter->hw_event.inherit) + continue; + + if (inherit_group(counter, parent, + parent_ctx, child, child_ctx)) + break; + } + + mutex_unlock(&parent_ctx->mutex); +} + +static void __cpuinit perf_counter_init_cpu(int cpu) +{ + struct perf_cpu_context *cpuctx; + + cpuctx = &per_cpu(perf_cpu_context, cpu); + __perf_counter_init_context(&cpuctx->ctx, NULL); + + mutex_lock(&perf_resource_mutex); + cpuctx->max_pertask = perf_max_counters - perf_reserved_percpu; + mutex_unlock(&perf_resource_mutex); + + hw_perf_counter_setup(cpu); +} + +#ifdef CONFIG_HOTPLUG_CPU +static void __perf_counter_exit_cpu(void *info) +{ + struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + struct perf_counter_context *ctx = &cpuctx->ctx; + struct perf_counter *counter, *tmp; + + list_for_each_entry_safe(counter, tmp, &ctx->counter_list, list_entry) + __perf_counter_remove_from_context(counter); +} +static void perf_counter_exit_cpu(int cpu) +{ + struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); + struct perf_counter_context *ctx = &cpuctx->ctx; + + mutex_lock(&ctx->mutex); + smp_call_function_single(cpu, __perf_counter_exit_cpu, NULL, 1); + mutex_unlock(&ctx->mutex); +} +#else +static inline void perf_counter_exit_cpu(int cpu) { } +#endif + +static int __cpuinit +perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) +{ + unsigned int cpu = (long)hcpu; + + switch (action) { + + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + perf_counter_init_cpu(cpu); + break; + + case CPU_DOWN_PREPARE: + case CPU_DOWN_PREPARE_FROZEN: + perf_counter_exit_cpu(cpu); + break; + + default: + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata perf_cpu_nb = { + .notifier_call = perf_cpu_notify, +}; + +static int __init perf_counter_init(void) +{ + perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE, + (void *)(long)smp_processor_id()); + register_cpu_notifier(&perf_cpu_nb); + + return 0; +} +early_initcall(perf_counter_init); + +static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf) +{ + return sprintf(buf, "%d\n", perf_reserved_percpu); +} + +static ssize_t +perf_set_reserve_percpu(struct sysdev_class *class, + const char *buf, + size_t count) +{ + struct perf_cpu_context *cpuctx; + unsigned long val; + int err, cpu, mpt; + + err = strict_strtoul(buf, 10, &val); + if (err) + return err; + if (val > perf_max_counters) + return -EINVAL; + + mutex_lock(&perf_resource_mutex); + perf_reserved_percpu = val; + for_each_online_cpu(cpu) { + cpuctx = &per_cpu(perf_cpu_context, cpu); + spin_lock_irq(&cpuctx->ctx.lock); + mpt = min(perf_max_counters - cpuctx->ctx.nr_counters, + perf_max_counters - perf_reserved_percpu); + cpuctx->max_pertask = mpt; + spin_unlock_irq(&cpuctx->ctx.lock); + } + mutex_unlock(&perf_resource_mutex); + + return count; +} + +static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf) +{ + return sprintf(buf, "%d\n", perf_overcommit); +} + +static ssize_t +perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count) +{ + unsigned long val; + int err; + + err = strict_strtoul(buf, 10, &val); + if (err) + return err; + if (val > 1) + return -EINVAL; + + mutex_lock(&perf_resource_mutex); + perf_overcommit = val; + mutex_unlock(&perf_resource_mutex); + + return count; +} + +static SYSDEV_CLASS_ATTR( + reserve_percpu, + 0644, + perf_show_reserve_percpu, + perf_set_reserve_percpu + ); + +static SYSDEV_CLASS_ATTR( + overcommit, + 0644, + perf_show_overcommit, + perf_set_overcommit + ); + +static struct attribute *perfclass_attrs[] = { + &attr_reserve_percpu.attr, + &attr_overcommit.attr, + NULL +}; + +static struct attribute_group perfclass_attr_group = { + .attrs = perfclass_attrs, + .name = "perf_counters", +}; + +static int __init perf_counter_sysfs_init(void) +{ + return sysfs_create_group(&cpu_sysdev_class.kset.kobj, + &perfclass_attr_group); +} +device_initcall(perf_counter_sysfs_init); Index: linux-2.6-tip/kernel/power/snapshot.c =================================================================== --- linux-2.6-tip.orig/kernel/power/snapshot.c +++ linux-2.6-tip/kernel/power/snapshot.c @@ -486,8 +486,8 @@ static int memory_bm_find_bit(struct mem static void memory_bm_set_bit(struct memory_bitmap *bm, unsigned long pfn) { - void *addr; - unsigned int bit; + unsigned int bit = 0; + void *addr = NULL; int error; error = memory_bm_find_bit(bm, pfn, &addr, &bit); @@ -520,8 +520,8 @@ static void memory_bm_clear_bit(struct m static int memory_bm_test_bit(struct memory_bitmap *bm, unsigned long pfn) { - void *addr; - unsigned int bit; + unsigned int bit = 0; + void *addr = NULL; int error; error = memory_bm_find_bit(bm, pfn, &addr, &bit); Index: linux-2.6-tip/kernel/profile.c =================================================================== --- linux-2.6-tip.orig/kernel/profile.c +++ linux-2.6-tip/kernel/profile.c @@ -263,6 +263,7 @@ EXPORT_SYMBOL_GPL(unregister_timer_hook) * * -- wli */ +#ifdef CONFIG_PROC_FS static void __profile_flip_buffers(void *unused) { int cpu = smp_processor_id(); @@ -308,57 +309,6 @@ static void profile_discard_flip_buffers mutex_unlock(&profile_flip_mutex); } -void profile_hits(int type, void *__pc, unsigned int nr_hits) -{ - unsigned long primary, secondary, flags, pc = (unsigned long)__pc; - int i, j, cpu; - struct profile_hit *hits; - - if (prof_on != type || !prof_buffer) - return; - pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1); - i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT; - secondary = (~(pc << 1) & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT; - cpu = get_cpu(); - hits = per_cpu(cpu_profile_hits, cpu)[per_cpu(cpu_profile_flip, cpu)]; - if (!hits) { - put_cpu(); - return; - } - /* - * We buffer the global profiler buffer into a per-CPU - * queue and thus reduce the number of global (and possibly - * NUMA-alien) accesses. The write-queue is self-coalescing: - */ - local_irq_save(flags); - do { - for (j = 0; j < PROFILE_GRPSZ; ++j) { - if (hits[i + j].pc == pc) { - hits[i + j].hits += nr_hits; - goto out; - } else if (!hits[i + j].hits) { - hits[i + j].pc = pc; - hits[i + j].hits = nr_hits; - goto out; - } - } - i = (i + secondary) & (NR_PROFILE_HIT - 1); - } while (i != primary); - - /* - * Add the current hit(s) and flush the write-queue out - * to the global buffer: - */ - atomic_add(nr_hits, &prof_buffer[pc]); - for (i = 0; i < NR_PROFILE_HIT; ++i) { - atomic_add(hits[i].hits, &prof_buffer[hits[i].pc]); - hits[i].pc = hits[i].hits = 0; - } -out: - local_irq_restore(flags); - put_cpu(); -} - static int __cpuinit profile_cpu_callback(struct notifier_block *info, unsigned long action, void *__cpu) { @@ -417,6 +367,60 @@ out_free: } return NOTIFY_OK; } +#endif /* CONFIG_PROC_FS */ + +void profile_hits(int type, void *__pc, unsigned int nr_hits) +{ + unsigned long primary, secondary, flags, pc = (unsigned long)__pc; + int i, j, cpu; + struct profile_hit *hits; + + if (prof_on != type || !prof_buffer) + return; + pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1); + i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT; + secondary = (~(pc << 1) & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT; + cpu = get_cpu(); + hits = per_cpu(cpu_profile_hits, cpu)[per_cpu(cpu_profile_flip, cpu)]; + if (!hits) { + put_cpu(); + return; + } + /* + * We buffer the global profiler buffer into a per-CPU + * queue and thus reduce the number of global (and possibly + * NUMA-alien) accesses. The write-queue is self-coalescing: + */ + local_irq_save(flags); + do { + for (j = 0; j < PROFILE_GRPSZ; ++j) { + if (hits[i + j].pc == pc) { + hits[i + j].hits += nr_hits; + goto out; + } else if (!hits[i + j].hits) { + hits[i + j].pc = pc; + hits[i + j].hits = nr_hits; + goto out; + } + } + i = (i + secondary) & (NR_PROFILE_HIT - 1); + } while (i != primary); + + /* + * Add the current hit(s) and flush the write-queue out + * to the global buffer: + */ + atomic_add(nr_hits, &prof_buffer[pc]); + for (i = 0; i < NR_PROFILE_HIT; ++i) { + atomic_add(hits[i].hits, &prof_buffer[hits[i].pc]); + hits[i].pc = hits[i].hits = 0; + } +out: + local_irq_restore(flags); + put_cpu(); +} + + #else /* !CONFIG_SMP */ #define profile_flip_buffers() do { } while (0) #define profile_discard_flip_buffers() do { } while (0) @@ -610,7 +614,7 @@ out_cleanup: #define create_hash_tables() ({ 0; }) #endif -int __ref create_proc_profile(void) /* false positive from hotcpu_notifier */ +int create_proc_profile(void) { struct proc_dir_entry *entry; Index: linux-2.6-tip/kernel/rcuclassic.c =================================================================== --- linux-2.6-tip.orig/kernel/rcuclassic.c +++ linux-2.6-tip/kernel/rcuclassic.c @@ -679,8 +679,8 @@ int rcu_needs_cpu(int cpu) void rcu_check_callbacks(int cpu, int user) { if (user || - (idle_cpu(cpu) && !in_softirq() && - hardirq_count() <= (1 << HARDIRQ_SHIFT))) { + (idle_cpu(cpu) && (system_state != SYSTEM_BOOTING) && + !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) { /* * Get here if this CPU took its interrupt from user Index: linux-2.6-tip/kernel/rcutree.c =================================================================== --- linux-2.6-tip.orig/kernel/rcutree.c +++ linux-2.6-tip/kernel/rcutree.c @@ -948,8 +948,8 @@ static void rcu_do_batch(struct rcu_data void rcu_check_callbacks(int cpu, int user) { if (user || - (idle_cpu(cpu) && !in_softirq() && - hardirq_count() <= (1 << HARDIRQ_SHIFT))) { + (idle_cpu(cpu) && (system_state != SYSTEM_BOOTING) && + !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) { /* * Get here if this CPU took its interrupt from user Index: linux-2.6-tip/kernel/relay.c =================================================================== --- linux-2.6-tip.orig/kernel/relay.c +++ linux-2.6-tip/kernel/relay.c @@ -343,6 +343,10 @@ static void wakeup_readers(unsigned long { struct rchan_buf *buf = (struct rchan_buf *)data; wake_up_interruptible(&buf->read_wait); + /* + * Stupid polling for now: + */ + mod_timer(&buf->timer, jiffies + 1); } /** @@ -360,6 +364,7 @@ static void __relay_reset(struct rchan_b init_waitqueue_head(&buf->read_wait); kref_init(&buf->kref); setup_timer(&buf->timer, wakeup_readers, (unsigned long)buf); + mod_timer(&buf->timer, jiffies + 1); } else del_timer_sync(&buf->timer); @@ -677,9 +682,7 @@ int relay_late_setup_files(struct rchan */ for_each_online_cpu(i) { if (unlikely(!chan->buf[i])) { - printk(KERN_ERR "relay_late_setup_files: CPU %u " - "has no buffer, it must have!\n", i); - BUG(); + WARN_ONCE(1, KERN_ERR "CPU has no buffer!\n"); err = -EINVAL; break; } @@ -742,15 +745,6 @@ size_t relay_switch_subbuf(struct rchan_ else buf->early_bytes += buf->chan->subbuf_size - buf->padding[old_subbuf]; - smp_mb(); - if (waitqueue_active(&buf->read_wait)) - /* - * Calling wake_up_interruptible() from here - * will deadlock if we happen to be logging - * from the scheduler (trying to re-grab - * rq->lock), so defer it. - */ - __mod_timer(&buf->timer, jiffies + 1); } old = buf->data; Index: linux-2.6-tip/kernel/sched.c =================================================================== --- linux-2.6-tip.orig/kernel/sched.c +++ linux-2.6-tip/kernel/sched.c @@ -4,6 +4,7 @@ * Kernel scheduler and related syscalls * * Copyright (C) 1991-2002 Linus Torvalds + * Copyright (C) 2004 Red Hat, Inc., Ingo Molnar * * 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and * make semaphores SMP safe @@ -16,6 +17,7 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin + * 2004-10-13 Real-Time Preemption support by Ingo Molnar * 2007-04-15 Work begun on replacing all interactivity tuning with a * fair scheduling design by Con Kolivas. * 2007-05-05 Load balancing (smp-nice) and other improvements @@ -60,6 +62,7 @@ #include #include #include +#include #include #include #include @@ -105,6 +108,20 @@ #define NICE_0_LOAD SCHED_LOAD_SCALE #define NICE_0_SHIFT SCHED_LOAD_SHIFT +#if (BITS_PER_LONG < 64) +#define JIFFIES_TO_NS64(TIME) \ + ((unsigned long long)(TIME) * ((unsigned long) (1000000000 / HZ))) + +#define NS64_TO_JIFFIES(TIME) \ + ((((unsigned long long)((TIME)) >> BITS_PER_LONG) * \ + (1 + NS_TO_JIFFIES(~0UL))) + NS_TO_JIFFIES((unsigned long)(TIME))) +#else /* BITS_PER_LONG < 64 */ + +#define NS64_TO_JIFFIES(TIME) NS_TO_JIFFIES(TIME) +#define JIFFIES_TO_NS64(TIME) JIFFIES_TO_NS(TIME) + +#endif /* BITS_PER_LONG < 64 */ + /* * These are the 'tuning knobs' of the scheduler: * @@ -148,6 +165,32 @@ static inline void sg_inc_cpu_power(stru } #endif +#define TASK_PREEMPTS_CURR(p, rq) \ + ((p)->prio < (rq)->curr->prio) + +/* + * Tweaks for current + */ + +#ifdef CURRENT_PTR +struct task_struct * const ___current = &init_task; +struct task_struct ** const current_ptr = (struct task_struct ** const)&___current; +struct thread_info * const current_ti = &init_thread_union.thread_info; +struct thread_info ** const current_ti_ptr = (struct thread_info ** const)¤t_ti; + +EXPORT_SYMBOL(___current); +EXPORT_SYMBOL(current_ti); + +/* + * The scheduler itself doesnt want 'current' to be cached + * during context-switches: + */ +# undef current +# define current __current() +# undef current_thread_info +# define current_thread_info() __current_thread_info() +#endif + static inline int rt_policy(int policy) { if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR)) @@ -170,7 +213,7 @@ struct rt_prio_array { struct rt_bandwidth { /* nests inside the rq lock: */ - spinlock_t rt_runtime_lock; + raw_spinlock_t rt_runtime_lock; ktime_t rt_period; u64 rt_runtime; struct hrtimer rt_period_timer; @@ -211,6 +254,7 @@ void init_rt_bandwidth(struct rt_bandwid hrtimer_init(&rt_b->rt_period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + rt_b->rt_period_timer.irqsafe = 1; rt_b->rt_period_timer.function = sched_rt_period_timer; } @@ -467,17 +511,24 @@ struct rt_rq { struct rt_prio_array active; unsigned long rt_nr_running; #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED - int highest_prio; /* highest queued rt task prio */ + struct { + int curr; /* highest queued rt task prio */ +#ifdef CONFIG_SMP + int next; /* next highest */ +#endif + } highest_prio; #endif #ifdef CONFIG_SMP unsigned long rt_nr_migratory; int overloaded; + struct plist_head pushable_tasks; #endif + unsigned long rt_nr_uninterruptible; int rt_throttled; u64 rt_time; u64 rt_runtime; /* Nests inside the rq lock: */ - spinlock_t rt_runtime_lock; + raw_spinlock_t rt_runtime_lock; #ifdef CONFIG_RT_GROUP_SCHED unsigned long rt_nr_boosted; @@ -540,7 +591,7 @@ static struct root_domain def_root_domai */ struct rq { /* runqueue lock: */ - spinlock_t lock; + raw_spinlock_t lock; /* * nr_running and cpu_load should be in the same cacheline because @@ -549,7 +600,6 @@ struct rq { unsigned long nr_running; #define CPU_LOAD_IDX_MAX 5 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; - unsigned char idle_at_tick; #ifdef CONFIG_NO_HZ unsigned long last_tick_seen; unsigned char in_nohz_recently; @@ -558,6 +608,7 @@ struct rq { struct load_weight load; unsigned long nr_load_updates; u64 nr_switches; + u64 nr_migrations_in; struct cfs_rq cfs; struct rt_rq rt; @@ -578,6 +629,8 @@ struct rq { */ unsigned long nr_uninterruptible; + unsigned long switch_timestamp; + unsigned long slice_avg; struct task_struct *curr, *idle; unsigned long next_balance; struct mm_struct *prev_mm; @@ -590,6 +643,7 @@ struct rq { struct root_domain *rd; struct sched_domain *sd; + unsigned char idle_at_tick; /* For active balancing */ int active_balance; int push_cpu; @@ -634,6 +688,13 @@ struct rq { /* BKL stats */ unsigned int bkl_count; + + /* RT-overload stats: */ + unsigned long rto_schedule; + unsigned long rto_schedule_tail; + unsigned long rto_wakeup; + unsigned long rto_pulled; + unsigned long rto_pushed; #endif }; @@ -668,11 +729,18 @@ static inline int cpu_of(struct rq *rq) #define task_rq(p) cpu_rq(task_cpu(p)) #define cpu_curr(cpu) (cpu_rq(cpu)->curr) -static inline void update_rq_clock(struct rq *rq) +inline void update_rq_clock(struct rq *rq) { rq->clock = sched_clock_cpu(cpu_of(rq)); } +#ifndef CONFIG_SMP +int task_is_current(struct task_struct *task) +{ + return task_rq(task)->curr == task; +} +#endif + /* * Tunables that become constants when CONFIG_SCHED_DEBUG is off: */ @@ -861,11 +929,23 @@ static inline u64 global_rt_runtime(void return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC; } +/* + * We really dont want to do anything complex within switch_to() + * on PREEMPT_RT - this check enforces this. + */ +#ifdef prepare_arch_switch +# ifdef CONFIG_PREEMPT_RT +# error FIXME +# else +# define _finish_arch_switch finish_arch_switch +# endif +#endif + #ifndef prepare_arch_switch # define prepare_arch_switch(next) do { } while (0) #endif #ifndef finish_arch_switch -# define finish_arch_switch(prev) do { } while (0) +# define _finish_arch_switch(prev) do { } while (0) #endif static inline int task_current(struct rq *rq, struct task_struct *p) @@ -873,18 +953,39 @@ static inline int task_current(struct rq return rq->curr == p; } -#ifndef __ARCH_WANT_UNLOCKED_CTXSW static inline int task_running(struct rq *rq, struct task_struct *p) { +#ifdef CONFIG_SMP + return p->oncpu; +#else return task_current(rq, p); +#endif } +#ifndef __ARCH_WANT_UNLOCKED_CTXSW static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next) { +#ifdef CONFIG_SMP + /* + * We can optimise this out completely for !SMP, because the + * SMP rebalancing from interrupt is the only thing that cares + * here. + */ + next->oncpu = 1; +#endif } static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) { +#ifdef CONFIG_SMP + /* + * After ->oncpu is cleared, the task can be moved to a different CPU. + * We must ensure this doesn't happen until the switch is completely + * finished. + */ + smp_wmb(); + prev->oncpu = 0; +#endif #ifdef CONFIG_DEBUG_SPINLOCK /* this is a valid case when another task releases the spinlock */ rq->lock.owner = current; @@ -896,18 +997,10 @@ static inline void finish_lock_switch(st */ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); - spin_unlock_irq(&rq->lock); + spin_unlock(&rq->lock); } #else /* __ARCH_WANT_UNLOCKED_CTXSW */ -static inline int task_running(struct rq *rq, struct task_struct *p) -{ -#ifdef CONFIG_SMP - return p->oncpu; -#else - return task_current(rq, p); -#endif -} static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next) { @@ -937,8 +1030,8 @@ static inline void finish_lock_switch(st smp_wmb(); prev->oncpu = 0; #endif -#ifndef __ARCH_WANT_INTERRUPTS_ON_CTXSW - local_irq_enable(); +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW + local_irq_disable(); #endif } #endif /* __ARCH_WANT_UNLOCKED_CTXSW */ @@ -979,6 +1072,26 @@ static struct rq *task_rq_lock(struct ta } } +void curr_rq_lock_irq_save(unsigned long *flags) + __acquires(rq->lock) +{ + struct rq *rq; + + local_irq_save(*flags); + rq = cpu_rq(smp_processor_id()); + spin_lock(&rq->lock); +} + +void curr_rq_unlock_irq_restore(unsigned long *flags) + __releases(rq->lock) +{ + struct rq *rq; + + rq = cpu_rq(smp_processor_id()); + spin_unlock(&rq->lock); + local_irq_restore(*flags); +} + void task_rq_unlock_wait(struct task_struct *p) { struct rq *rq = task_rq(p); @@ -1093,7 +1206,7 @@ static void hrtick_start(struct rq *rq, if (rq == this_rq()) { hrtimer_restart(timer); } else if (!rq->hrtick_csd_pending) { - __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd); + __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0); rq->hrtick_csd_pending = 1; } } @@ -1149,6 +1262,7 @@ static void init_rq_hrtick(struct rq *rq hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); rq->hrtick_timer.function = hrtick; + rq->hrtick_timer.irqsafe = 1; } #else /* CONFIG_SCHED_HRTICK */ static inline void hrtick_clear(struct rq *rq) @@ -1224,7 +1338,7 @@ void wake_up_idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); - if (cpu == smp_processor_id()) + if (cpu == raw_smp_processor_id()) return; /* @@ -1610,21 +1724,42 @@ static inline void update_shares_locked( #endif +#ifdef CONFIG_PREEMPT + /* - * double_lock_balance - lock the busiest runqueue, this_rq is locked already. + * fair double_lock_balance: Safely acquires both rq->locks in a fair + * way at the expense of forcing extra atomic operations in all + * invocations. This assures that the double_lock is acquired using the + * same underlying policy as the spinlock_t on this architecture, which + * reduces latency compared to the unfair variant below. However, it + * also adds more overhead and therefore may reduce throughput. */ -static int double_lock_balance(struct rq *this_rq, struct rq *busiest) +static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) + __releases(this_rq->lock) + __acquires(busiest->lock) + __acquires(this_rq->lock) +{ + spin_unlock(&this_rq->lock); + double_rq_lock(this_rq, busiest); + + return 1; +} + +#else +/* + * Unfair double_lock_balance: Optimizes throughput at the expense of + * latency by eliminating extra atomic operations when the locks are + * already in proper order on entry. This favors lower cpu-ids and will + * grant the double lock to lower cpus over higher ids under contention, + * regardless of entry order into the function. + */ +static int _double_lock_balance(struct rq *this_rq, struct rq *busiest) __releases(this_rq->lock) __acquires(busiest->lock) __acquires(this_rq->lock) { int ret = 0; - if (unlikely(!irqs_disabled())) { - /* printk() doesn't work good under rq->lock */ - spin_unlock(&this_rq->lock); - BUG_ON(1); - } if (unlikely(!spin_trylock(&busiest->lock))) { if (busiest < this_rq) { spin_unlock(&this_rq->lock); @@ -1637,6 +1772,22 @@ static int double_lock_balance(struct rq return ret; } +#endif /* CONFIG_PREEMPT */ + +/* + * double_lock_balance - lock the busiest runqueue, this_rq is locked already. + */ +static int double_lock_balance(struct rq *this_rq, struct rq *busiest) +{ + if (unlikely(!irqs_disabled())) { + /* printk() doesn't work good under rq->lock */ + spin_unlock(&this_rq->lock); + BUG_ON(1); + } + + return _double_lock_balance(this_rq, busiest); +} + static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest) __releases(busiest->lock) { @@ -1705,6 +1856,9 @@ static void update_avg(u64 *avg, u64 sam static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup) { + if (wakeup) + p->se.start_runtime = p->se.sum_exec_runtime; + sched_info_queued(p); p->sched_class->enqueue_task(rq, p, wakeup); p->se.on_rq = 1; @@ -1712,10 +1866,15 @@ static void enqueue_task(struct rq *rq, static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) { - if (sleep && p->se.last_wakeup) { - update_avg(&p->se.avg_overlap, - p->se.sum_exec_runtime - p->se.last_wakeup); - p->se.last_wakeup = 0; + if (sleep) { + if (p->se.last_wakeup) { + update_avg(&p->se.avg_overlap, + p->se.sum_exec_runtime - p->se.last_wakeup); + p->se.last_wakeup = 0; + } else { + update_avg(&p->se.avg_wakeup, + sysctl_sched_wakeup_granularity); + } } sched_info_dequeued(p); @@ -1746,6 +1905,8 @@ static inline int normal_prio(struct tas prio = MAX_RT_PRIO-1 - p->rt_priority; else prio = __normal_prio(p); + +// trace_special_pid(p->pid, PRIO(p), __PRIO(prio)); return prio; } @@ -1885,12 +2046,15 @@ void set_task_cpu(struct task_struct *p, p->se.sleep_start -= clock_offset; if (p->se.block_start) p->se.block_start -= clock_offset; +#endif if (old_cpu != new_cpu) { - schedstat_inc(p, se.nr_migrations); + p->se.nr_migrations++; + new_rq->nr_migrations_in++; +#ifdef CONFIG_SCHEDSTATS if (task_hot(p, old_rq->clock, NULL)) schedstat_inc(p, se.nr_forced2_migrations); - } #endif + } p->se.vruntime -= old_cfsrq->min_vruntime - new_cfsrq->min_vruntime; @@ -2242,6 +2406,47 @@ static int sched_balance_self(int cpu, i #endif /* CONFIG_SMP */ +#ifdef CONFIG_DEBUG_PREEMPT +void notrace preempt_enable_no_resched(void) +{ + static int once = 1; + + barrier(); + dec_preempt_count(); + + if (once && !preempt_count()) { + once = 0; + printk(KERN_ERR "BUG: %s:%d task might have lost a preemption check!\n", + current->comm, current->pid); + dump_stack(); + } +} + +EXPORT_SYMBOL(preempt_enable_no_resched); +#endif + + +/** + * task_oncpu_function_call - call a function on the cpu on which a task runs + * @p: the task to evaluate + * @func: the function to be called + * @info: the function call argument + * + * Calls the function @func when the task is currently running. This might + * be on the current CPU, which just calls the function directly + */ +void task_oncpu_function_call(struct task_struct *p, + void (*func) (void *info), void *info) +{ + int cpu; + + preempt_disable(); + cpu = task_cpu(p); + if (task_curr(p)) + smp_call_function_single(cpu, func, info, 1); + preempt_enable(); +} + /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -2256,7 +2461,8 @@ static int sched_balance_self(int cpu, i * * returns failure only if the task is already active. */ -static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync) +static int +try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex) { int cpu, orig_cpu, this_cpu, success = 0; unsigned long flags; @@ -2282,6 +2488,13 @@ static int try_to_wake_up(struct task_st } #endif +#ifdef CONFIG_PREEMPT_RT + /* + * sync wakeups can increase wakeup latencies: + */ + if (rt_task(p)) + sync = 0; +#endif smp_wmb(); rq = task_rq_lock(p, &flags); update_rq_clock(rq); @@ -2345,18 +2558,43 @@ out_activate: activate_task(rq, p, 1); success = 1; + /* + * Only attribute actual wakeups done by this task. + */ + if (!in_interrupt()) { + struct sched_entity *se = ¤t->se; + u64 sample = se->sum_exec_runtime; + + if (se->last_wakeup) + sample -= se->last_wakeup; + else + sample -= se->start_runtime; + update_avg(&se->avg_wakeup, sample); + + se->last_wakeup = se->sum_exec_runtime; + } + out_running: trace_sched_wakeup(rq, p, success); check_preempt_curr(rq, p, sync); - p->state = TASK_RUNNING; + /* + * For a mutex wakeup we or TASK_RUNNING_MUTEX to the task + * state to preserve the original state, so a real wakeup + * still can see the (UN)INTERRUPTIBLE bits in the state check + * above. We dont have to worry about the | TASK_RUNNING_MUTEX + * here. The waiter is serialized by the mutex lock and nobody + * else can fiddle with p->state as we hold rq lock. + */ + if (mutex) + p->state |= TASK_RUNNING_MUTEX; + else + p->state = TASK_RUNNING; #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) p->sched_class->task_wake_up(rq, p); #endif out: - current->se.last_wakeup = current->se.sum_exec_runtime; - task_rq_unlock(rq, &flags); return success; @@ -2364,13 +2602,31 @@ out: int wake_up_process(struct task_struct *p) { - return try_to_wake_up(p, TASK_ALL, 0); + return try_to_wake_up(p, TASK_ALL, 0, 0); } EXPORT_SYMBOL(wake_up_process); +int wake_up_process_sync(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_ALL, 1, 0); +} +EXPORT_SYMBOL(wake_up_process_sync); + +int wake_up_process_mutex(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_ALL, 0, 1); +} +EXPORT_SYMBOL(wake_up_process_mutex); + +int wake_up_process_mutex_sync(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_ALL, 1, 1); +} +EXPORT_SYMBOL(wake_up_process_mutex_sync); + int wake_up_state(struct task_struct *p, unsigned int state) { - return try_to_wake_up(p, state, 0); + return try_to_wake_up(p, state, 0, 0); } /* @@ -2384,8 +2640,11 @@ static void __sched_fork(struct task_str p->se.exec_start = 0; p->se.sum_exec_runtime = 0; p->se.prev_sum_exec_runtime = 0; + p->se.nr_migrations = 0; p->se.last_wakeup = 0; p->se.avg_overlap = 0; + p->se.start_runtime = 0; + p->se.avg_wakeup = sysctl_sched_wakeup_granularity; #ifdef CONFIG_SCHEDSTATS p->se.wait_start = 0; @@ -2441,13 +2700,15 @@ void sched_fork(struct task_struct *p, i if (likely(sched_info_on())) memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) p->oncpu = 0; #endif #ifdef CONFIG_PREEMPT /* Want to start with kernel preemption disabled. */ task_thread_info(p)->preempt_count = 1; #endif + plist_node_init(&p->pushable_tasks, MAX_PRIO); + put_cpu(); } @@ -2588,6 +2849,12 @@ static void finish_task_switch(struct rq { struct mm_struct *mm = rq->prev_mm; long prev_state; +#ifdef CONFIG_SMP + int post_schedule = 0; + + if (current->sched_class->needs_post_schedule) + post_schedule = current->sched_class->needs_post_schedule(rq); +#endif rq->prev_mm = NULL; @@ -2603,16 +2870,21 @@ static void finish_task_switch(struct rq * Manfred Spraul */ prev_state = prev->state; - finish_arch_switch(prev); + _finish_arch_switch(prev); + perf_counter_task_sched_in(current, cpu_of(rq)); finish_lock_switch(rq, prev); #ifdef CONFIG_SMP - if (current->sched_class->post_schedule) + if (post_schedule) current->sched_class->post_schedule(rq); #endif fire_sched_in_preempt_notifiers(current); + /* + * Delay the final freeing of the mm or task, so that we dont have + * to do complex work from within the scheduler: + */ if (mm) - mmdrop(mm); + mmdrop_delayed(mm); if (unlikely(prev_state == TASK_DEAD)) { /* * Remove function-return probe instances associated with this @@ -2630,12 +2902,15 @@ static void finish_task_switch(struct rq asmlinkage void schedule_tail(struct task_struct *prev) __releases(rq->lock) { - struct rq *rq = this_rq(); - - finish_task_switch(rq, prev); + preempt_disable(); + finish_task_switch(this_rq(), prev); + __preempt_enable_no_resched(); + local_irq_enable(); #ifdef __ARCH_WANT_UNLOCKED_CTXSW /* In this case, finish_task_switch does not reenable preemption */ preempt_enable(); +#else + preempt_check_resched(); #endif if (current->set_child_tid) put_user(task_pid_vnr(current), current->set_child_tid); @@ -2683,6 +2958,11 @@ context_switch(struct rq *rq, struct tas spin_release(&rq->lock.dep_map, 1, _THIS_IP_); #endif +#ifdef CURRENT_PTR + barrier(); + *current_ptr = next; + *current_ti_ptr = next->thread_info; +#endif /* Here we just switch the register state and the stack. */ switch_to(prev, next, prev); @@ -2729,6 +3009,11 @@ unsigned long nr_uninterruptible(void) return sum; } +unsigned long nr_uninterruptible_cpu(int cpu) +{ + return cpu_rq(cpu)->nr_uninterruptible; +} + unsigned long long nr_context_switches(void) { int i; @@ -2747,6 +3032,13 @@ unsigned long nr_iowait(void) for_each_possible_cpu(i) sum += atomic_read(&cpu_rq(i)->nr_iowait); + /* + * Since we read the counters lockless, it might be slightly + * inaccurate. Do not allow it to go below zero though: + */ + if (unlikely((long)sum < 0)) + sum = 0; + return sum; } @@ -2766,6 +3058,21 @@ unsigned long nr_active(void) } /* + * Externally visible per-cpu scheduler statistics: + * cpu_nr_switches(cpu) - number of context switches on that cpu + * cpu_nr_migrations(cpu) - number of migrations into that cpu + */ +u64 cpu_nr_switches(int cpu) +{ + return cpu_rq(cpu)->nr_switches; +} + +u64 cpu_nr_migrations(int cpu) +{ + return cpu_rq(cpu)->nr_migrations_in; +} + +/* * Update rq->cpu_load[] statistics. This function is usually called every * scheduler tick (TICK_NSEC). */ @@ -2987,6 +3294,16 @@ next: pulled++; rem_load_move -= p->se.load.weight; +#ifdef CONFIG_PREEMPT + /* + * NEWIDLE balancing is a source of latency, so preemptible kernels + * will stop after the first task is pulled to minimize the critical + * section. + */ + if (idle == CPU_NEWLY_IDLE) + goto out; +#endif + /* * We only want to steal up to the prescribed amount of weighted load. */ @@ -3033,9 +3350,15 @@ static int move_tasks(struct rq *this_rq sd, idle, all_pinned, &this_best_prio); class = class->next; +#ifdef CONFIG_PREEMPT + /* + * NEWIDLE balancing is a source of latency, so preemptible + * kernels will stop after the first task is pulled to minimize + * the critical section. + */ if (idle == CPU_NEWLY_IDLE && this_rq->nr_running) break; - +#endif } while (class && max_load_move > total_load_moved); return total_load_moved > 0; @@ -4017,7 +4340,7 @@ out: */ static void run_rebalance_domains(struct softirq_action *h) { - int this_cpu = smp_processor_id(); + int this_cpu = raw_smp_processor_id(); struct rq *this_rq = cpu_rq(this_cpu); enum cpu_idle_type idle = this_rq->idle_at_tick ? CPU_IDLE : CPU_NOT_IDLE; @@ -4137,6 +4460,29 @@ EXPORT_PER_CPU_SYMBOL(kstat); * Return any ns on the sched_clock that have not yet been banked in * @p in case that task is currently running. */ +unsigned long long __task_delta_exec(struct task_struct *p, int update) +{ + s64 delta_exec; + struct rq *rq; + + rq = task_rq(p); + WARN_ON_ONCE(!runqueue_is_locked()); + WARN_ON_ONCE(!task_current(rq, p)); + + if (update) + update_rq_clock(rq); + + delta_exec = rq->clock - p->se.exec_start; + + WARN_ON_ONCE(delta_exec < 0); + + return delta_exec; +} + +/* + * Return any ns on the sched_clock that have not yet been banked in + * @p in case that task is currently running. + */ unsigned long long task_delta_exec(struct task_struct *p) { unsigned long flags; @@ -4178,7 +4524,9 @@ void account_user_time(struct task_struc /* Add user time to cpustat. */ tmp = cputime_to_cputime64(cputime); - if (TASK_NICE(p) > 0) + if (rt_task(p)) + cpustat->user_rt = cputime64_add(cpustat->user_rt, tmp); + else if (TASK_NICE(p) > 0) cpustat->nice = cputime64_add(cpustat->nice, tmp); else cpustat->user = cputime64_add(cpustat->user, tmp); @@ -4236,10 +4584,12 @@ void account_system_time(struct task_str /* Add system time to cpustat. */ tmp = cputime_to_cputime64(cputime); - if (hardirq_count() - hardirq_offset) + if (hardirq_count() - hardirq_offset || (p->flags & PF_HARDIRQ)) cpustat->irq = cputime64_add(cpustat->irq, tmp); - else if (softirq_count()) + else if (softirq_count() || (p->flags & PF_SOFTIRQ)) cpustat->softirq = cputime64_add(cpustat->softirq, tmp); + else if (rt_task(p)) + cpustat->system_rt = cputime64_add(cpustat->system_rt, tmp); else cpustat->system = cputime64_add(cpustat->system, tmp); @@ -4392,10 +4742,14 @@ void scheduler_tick(void) sched_clock_tick(); + BUG_ON(!irqs_disabled()); + spin_lock(&rq->lock); update_rq_clock(rq); update_cpu_load(rq); - curr->sched_class->task_tick(rq, curr, 0); + if (curr != rq->idle && curr->se.on_rq) + curr->sched_class->task_tick(rq, curr, 0); + perf_counter_task_tick(curr, cpu); spin_unlock(&rq->lock); #ifdef CONFIG_SMP @@ -4404,10 +4758,7 @@ void scheduler_tick(void) #endif } -#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ - defined(CONFIG_PREEMPT_TRACER)) - -static inline unsigned long get_parent_ip(unsigned long addr) +unsigned long notrace get_parent_ip(unsigned long addr) { if (in_lock_functions(addr)) { addr = CALLER_ADDR2; @@ -4417,6 +4768,9 @@ static inline unsigned long get_parent_i return addr; } +#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ + defined(CONFIG_PREEMPT_TRACER)) + void __kprobes add_preempt_count(int val) { #ifdef CONFIG_DEBUG_PREEMPT @@ -4470,8 +4824,8 @@ static noinline void __schedule_bug(stru { struct pt_regs *regs = get_irq_regs(); - printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n", - prev->comm, prev->pid, preempt_count()); + printk(KERN_ERR "BUG: scheduling while atomic: %s/0x%08x/%d, CPU#%d\n", + prev->comm, preempt_count(), prev->pid, smp_processor_id()); debug_show_held_locks(prev); print_modules(); @@ -4489,12 +4843,14 @@ static noinline void __schedule_bug(stru */ static inline void schedule_debug(struct task_struct *prev) { +// WARN_ON(system_state == SYSTEM_BOOTING); + /* * Test if we are atomic. Since do_exit() needs to call into * schedule() atomically, we ignore that path for now. * Otherwise, whine if we are scheduling when we should not be. */ - if (unlikely(in_atomic_preempt_off() && !prev->exit_state)) + if (unlikely(in_atomic() && !prev->exit_state)) __schedule_bug(prev); profile_hit(SCHED_PROFILING, __builtin_return_address(0)); @@ -4543,15 +4899,13 @@ pick_next_task(struct rq *rq, struct tas /* * schedule() is the main scheduler function. */ -asmlinkage void __sched schedule(void) +asmlinkage void __sched __schedule(void) { struct task_struct *prev, *next; unsigned long *switch_count; struct rq *rq; int cpu; -need_resched: - preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); rcu_qsctr_inc(cpu); @@ -4559,10 +4913,11 @@ need_resched: switch_count = &prev->nivcsw; release_kernel_lock(prev); -need_resched_nonpreemptible: schedule_debug(prev); + preempt_disable(); + if (sched_feat(HRTICK)) hrtick_clear(rq); @@ -4570,14 +4925,20 @@ need_resched_nonpreemptible: update_rq_clock(rq); clear_tsk_need_resched(prev); - if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { + if (!(prev->state & TASK_RUNNING_MUTEX) && prev->state && + !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely(signal_pending_state(prev->state, prev))) prev->state = TASK_RUNNING; - else + else { + touch_softlockup_watchdog(); deactivate_task(rq, prev, 1); + } switch_count = &prev->nvcsw; } + if (preempt_count() & PREEMPT_ACTIVE) + sub_preempt_count(PREEMPT_ACTIVE); + #ifdef CONFIG_SMP if (prev->sched_class->pre_schedule) prev->sched_class->pre_schedule(rq, prev); @@ -4591,6 +4952,7 @@ need_resched_nonpreemptible: if (likely(prev != next)) { sched_info_switch(prev, next); + perf_counter_task_sched_out(prev, cpu); rq->nr_switches++; rq->curr = next; @@ -4603,19 +4965,118 @@ need_resched_nonpreemptible: */ cpu = smp_processor_id(); rq = cpu_rq(cpu); - } else - spin_unlock_irq(&rq->lock); + __preempt_enable_no_resched(); + } else { + __preempt_enable_no_resched(); + spin_unlock(&rq->lock); + } + + reacquire_kernel_lock(current); +} - if (unlikely(reacquire_kernel_lock(current) < 0)) - goto need_resched_nonpreemptible; +asmlinkage void __sched schedule(void) +{ +need_resched: + local_irq_disable(); + __schedule(); + local_irq_enable(); - preempt_enable_no_resched(); if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) goto need_resched; } EXPORT_SYMBOL(schedule); +#if defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_RT) +/* + * Look out! "owner" is an entirely speculative pointer + * access and not reliable. + */ +int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) +{ + unsigned int cpu; + struct rq *rq; + + if (!sched_feat(OWNER_SPIN)) + return 0; + +#ifdef CONFIG_DEBUG_PAGEALLOC + /* + * Need to access the cpu field knowing that + * DEBUG_PAGEALLOC could have unmapped it if + * the mutex owner just released it and exited. + */ + if (probe_kernel_address(&owner->cpu, cpu)) + goto out; +#else + cpu = owner->cpu; +#endif + + /* + * Even if the access succeeded (likely case), + * the cpu field may no longer be valid. + */ + if (cpu >= nr_cpumask_bits) + goto out; + + /* + * We need to validate that we can do a + * get_cpu() and that we have the percpu area. + */ + if (!cpu_online(cpu)) + goto out; + + rq = cpu_rq(cpu); + + for (;;) { + /* + * Owner changed, break to re-assess state. + */ + if (lock->owner != owner) + break; + + /* + * Is that owner really running on that cpu? + */ + if (task_thread_info(rq->curr) != owner || need_resched()) + return 0; + + cpu_relax(); + } +out: + return 1; +} +#endif + #ifdef CONFIG_PREEMPT + +/* + * Global flag to turn preemption off on a CONFIG_PREEMPT kernel: + */ +int kernel_preemption = 1; + +static int __init preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) { + if (kernel_preemption) { + printk(KERN_INFO "turning off kernel preemption!\n"); + kernel_preemption = 0; + } + return 1; + } + if (!strncmp(str, "on", 2)) { + if (!kernel_preemption) { + printk(KERN_INFO "turning on kernel preemption!\n"); + kernel_preemption = 1; + } + return 1; + } + get_option(&str, &kernel_preemption); + + return 1; +} + +__setup("preempt=", preempt_setup); + /* * this is the entry point to schedule() from in-kernel preemption * off of preempt_enable. Kernel preemptions off return from interrupt @@ -4624,7 +5085,11 @@ EXPORT_SYMBOL(schedule); asmlinkage void __sched preempt_schedule(void) { struct thread_info *ti = current_thread_info(); + struct task_struct *task = current; + int saved_lock_depth; + if (!kernel_preemption) + return; /* * If there is a non-zero preempt_count or interrupts are disabled, * we do not want to preempt the current task. Just return.. @@ -4633,9 +5098,19 @@ asmlinkage void __sched preempt_schedule return; do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); - schedule(); - sub_preempt_count(PREEMPT_ACTIVE); + + /* + * We keep the big kernel semaphore locked, but we + * clear ->lock_depth so that schedule() doesnt + * auto-release the semaphore: + */ + saved_lock_depth = task->lock_depth; + task->lock_depth = -1; + __schedule(); + task->lock_depth = saved_lock_depth; + local_irq_enable(); /* * Check again in case we missed a preemption opportunity @@ -4647,24 +5122,40 @@ asmlinkage void __sched preempt_schedule EXPORT_SYMBOL(preempt_schedule); /* - * this is the entry point to schedule() from kernel preemption - * off of irq context. - * Note, that this is called and return with irqs disabled. This will - * protect us against recursive calling from irq. + * this is is the entry point for the IRQ return path. Called with + * interrupts disabled. To avoid infinite irq-entry recursion problems + * with fast-paced IRQ sources we do all of this carefully to never + * enable interrupts again. */ asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); + struct task_struct *task = current; + int saved_lock_depth; - /* Catch callers which need to be fixed */ - BUG_ON(ti->preempt_count || !irqs_disabled()); + if (!kernel_preemption) + return; + /* + * If there is a non-zero preempt_count then just return. + * (interrupts are disabled) + */ + if (unlikely(ti->preempt_count)) + return; do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); - local_irq_enable(); - schedule(); + + /* + * We keep the big kernel semaphore locked, but we + * clear ->lock_depth so that schedule() doesnt + * auto-release the semaphore: + */ + saved_lock_depth = task->lock_depth; + task->lock_depth = -1; + __schedule(); local_irq_disable(); - sub_preempt_count(PREEMPT_ACTIVE); + task->lock_depth = saved_lock_depth; /* * Check again in case we missed a preemption opportunity @@ -4679,7 +5170,7 @@ asmlinkage void __sched preempt_schedule int default_wake_function(wait_queue_t *curr, unsigned mode, int sync, void *key) { - return try_to_wake_up(curr->private, mode, sync); + return try_to_wake_up(curr->private, mode, sync, 0); } EXPORT_SYMBOL(default_wake_function); @@ -4719,7 +5210,7 @@ void __wake_up(wait_queue_head_t *q, uns unsigned long flags; spin_lock_irqsave(&q->lock, flags); - __wake_up_common(q, mode, nr_exclusive, 0, key); + __wake_up_common(q, mode, nr_exclusive, 1, key); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(__wake_up); @@ -4778,7 +5269,7 @@ void complete(struct completion *x) spin_lock_irqsave(&x->wait.lock, flags); x->done++; - __wake_up_common(&x->wait, TASK_NORMAL, 1, 0, NULL); + __wake_up_common(&x->wait, TASK_NORMAL, 1, 1, NULL); spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete); @@ -4795,7 +5286,7 @@ void complete_all(struct completion *x) spin_lock_irqsave(&x->wait.lock, flags); x->done += UINT_MAX/2; - __wake_up_common(&x->wait, TASK_NORMAL, 0, 0, NULL); + __wake_up_common(&x->wait, TASK_NORMAL, 0, 1, NULL); spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete_all); @@ -5009,19 +5500,19 @@ long __sched sleep_on_timeout(wait_queue } EXPORT_SYMBOL(sleep_on_timeout); -#ifdef CONFIG_RT_MUTEXES - /* - * rt_mutex_setprio - set the current priority of a task + * task_setprio - set the current priority of a task * @p: task * @prio: prio value (kernel-internal form) * * This function changes the 'effective' priority of a task. It does * not touch ->normal_prio like __setscheduler(). * - * Used by the rt_mutex code to implement priority inheritance logic. + * Used by the rt_mutex code to implement priority inheritance logic + * and by rcupreempt-boost to boost priorities of tasks sleeping + * with rcu locks. */ -void rt_mutex_setprio(struct task_struct *p, int prio) +void task_setprio(struct task_struct *p, int prio) { unsigned long flags; int oldprio, on_rq, running; @@ -5031,6 +5522,25 @@ void rt_mutex_setprio(struct task_struct BUG_ON(prio < 0 || prio > MAX_PRIO); rq = task_rq_lock(p, &flags); + + /* + * Idle task boosting is a nono in general. There is one + * exception, when NOHZ is active: + * + * The idle task calls get_next_timer_interrupt() and holds + * the timer wheel base->lock on the CPU and another CPU wants + * to access the timer (probably to cancel it). We can safely + * ignore the boosting request, as the idle CPU runs this code + * with interrupts disabled and will complete the lock + * protected section without being interrupted. So there is no + * real need to boost. + */ + if (unlikely(p == rq->idle)) { + WARN_ON(p != rq->curr); + WARN_ON(p->pi_blocked_on); + goto out_unlock; + } + update_rq_clock(rq); oldprio = p->prio; @@ -5048,6 +5558,8 @@ void rt_mutex_setprio(struct task_struct p->prio = prio; +// trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p)); + if (running) p->sched_class->set_curr_task(rq); if (on_rq) { @@ -5055,11 +5567,12 @@ void rt_mutex_setprio(struct task_struct check_class_changed(rq, p, prev_class, oldprio, running); } +// trace_special(prev_resched, _need_resched(), 0); + +out_unlock: task_rq_unlock(rq, &flags); } -#endif - void set_user_nice(struct task_struct *p, long nice) { int old_prio, delta, on_rq; @@ -5145,7 +5658,7 @@ SYSCALL_DEFINE1(nice, int, increment) if (increment > 40) increment = 40; - nice = PRIO_TO_NICE(current->static_prio) + increment; + nice = TASK_NICE(current) + increment; if (nice < -20) nice = -20; if (nice > 19) @@ -5694,19 +6207,53 @@ SYSCALL_DEFINE0(sched_yield) * Since we are going to call schedule() anyway, there's * no need to preempt or enable interrupts: */ - __release(rq->lock); - spin_release(&rq->lock.dep_map, 1, _THIS_IP_); - _raw_spin_unlock(&rq->lock); - preempt_enable_no_resched(); + spin_unlock_no_resched(&rq->lock); - schedule(); + __schedule(); + + local_irq_enable(); + preempt_check_resched(); return 0; } +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) +void __might_sleep(char *file, int line) +{ +#ifdef in_atomic + static unsigned long prev_jiffy; /* ratelimiting */ + + if ((!in_atomic() && !irqs_disabled()) || + system_state != SYSTEM_RUNNING || oops_in_progress) + return; + + if (debug_direct_keyboard && hardirq_count()) + return; + + if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) + return; + prev_jiffy = jiffies; + + printk(KERN_ERR + "BUG: sleeping function called from invalid context at %s:%d\n", + file, line); + printk(KERN_ERR + "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), + current->pid, current->comm); + + debug_show_held_locks(current); + if (irqs_disabled()) + print_irqtrace_events(current); + dump_stack(); +#endif +} +EXPORT_SYMBOL(__might_sleep); +#endif + static void __cond_resched(void) { -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) __might_sleep(__FILE__, __LINE__); #endif /* @@ -5715,10 +6262,11 @@ static void __cond_resched(void) * cond_resched() call. */ do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); - schedule(); - sub_preempt_count(PREEMPT_ACTIVE); + __schedule(); } while (need_resched()); + local_irq_enable(); } int __sched _cond_resched(void) @@ -5740,13 +6288,13 @@ EXPORT_SYMBOL(_cond_resched); * operations here to prevent schedule() from being called twice (once via * spin_unlock(), once by hand). */ -int cond_resched_lock(spinlock_t *lock) +int __cond_resched_raw_spinlock(raw_spinlock_t *lock) { int resched = need_resched() && system_state == SYSTEM_RUNNING; int ret = 0; if (spin_needbreak(lock) || resched) { - spin_unlock(lock); + spin_unlock_no_resched(lock); if (resched && need_resched()) __cond_resched(); else @@ -5756,12 +6304,37 @@ int cond_resched_lock(spinlock_t *lock) } return ret; } -EXPORT_SYMBOL(cond_resched_lock); +EXPORT_SYMBOL(__cond_resched_raw_spinlock); -int __sched cond_resched_softirq(void) +#ifdef CONFIG_PREEMPT_RT + +int __cond_resched_spinlock(spinlock_t *lock) { - BUG_ON(!in_softirq()); +#if (defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)) || defined(CONFIG_PREEMPT_RT) + if (lock->break_lock) { + lock->break_lock = 0; + spin_unlock_no_resched(lock); + __cond_resched(); + spin_lock(lock); + return 1; + } +#endif + return 0; +} +EXPORT_SYMBOL(__cond_resched_spinlock); +#endif + +/* + * Voluntarily preempt a process context that has softirqs disabled: + */ +int __sched cond_resched_softirq(void) +{ +#ifndef CONFIG_PREEMPT_SOFTIRQS + WARN_ON_ONCE(!in_softirq()); + if (!in_softirq()) + return 0; +#endif if (need_resched() && system_state == SYSTEM_RUNNING) { local_bh_enable(); __cond_resched(); @@ -5772,17 +6345,102 @@ int __sched cond_resched_softirq(void) } EXPORT_SYMBOL(cond_resched_softirq); +/* + * Voluntarily preempt a softirq context (possible with softirq threading): + */ +int __sched cond_resched_softirq_context(void) +{ + WARN_ON_ONCE(!in_softirq()); + + if (softirq_need_resched() && system_state == SYSTEM_RUNNING) { + raw_local_irq_disable(); + _local_bh_enable(); + raw_local_irq_enable(); + __cond_resched(); + local_bh_disable(); + return 1; + } + return 0; +} +EXPORT_SYMBOL(cond_resched_softirq_context); + +/* + * Preempt a hardirq context if necessary (possible with hardirq threading): + */ +int cond_resched_hardirq_context(void) +{ + WARN_ON_ONCE(!in_irq()); + WARN_ON_ONCE(!irqs_disabled()); + + if (hardirq_need_resched()) { +#ifndef CONFIG_PREEMPT_RT + irq_exit(); +#endif + local_irq_enable(); + __cond_resched(); +#ifndef CONFIG_PREEMPT_RT + local_irq_disable(); + __irq_enter(); +#endif + + return 1; + } + return 0; +} +EXPORT_SYMBOL(cond_resched_hardirq_context); + +#ifdef CONFIG_PREEMPT_VOLUNTARY + +int voluntary_preemption = 1; + +EXPORT_SYMBOL(voluntary_preemption); + +static int __init voluntary_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + voluntary_preemption = 0; + else + get_option(&str, &voluntary_preemption); + if (!voluntary_preemption) + printk("turning off voluntary preemption!\n"); + + return 1; +} + +__setup("voluntary-preempt=", voluntary_preempt_setup); + +#endif + /** * yield - yield the current processor to other threads. * * This is a shortcut for kernel-space yielding - it marks the * thread runnable and calls sys_sched_yield(). */ -void __sched yield(void) +void __sched __yield(void) { set_current_state(TASK_RUNNING); sys_sched_yield(); } + +void __sched yield(void) +{ + static int once = 1; + + /* + * it's a bug to rely on yield() with RT priorities. We print + * the first occurance after bootup ... this will still give + * us an idea about the scope of the problem, without spamming + * the syslog: + */ + if (once && rt_task(current)) { + once = 0; + printk(KERN_ERR "BUG: %s:%d RT task yield()-ing!\n", + current->comm, current->pid); + dump_stack(); + } + __yield(); +} EXPORT_SYMBOL(yield); /* @@ -5930,26 +6588,26 @@ void sched_show_task(struct task_struct unsigned state; state = p->state ? __ffs(p->state) + 1 : 0; - printk(KERN_INFO "%-13.13s %c", p->comm, - state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?'); + printk("%-13.13s %c (%03lx) [%p]", p->comm, + state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?', + (unsigned long) p->state, p); #if BITS_PER_LONG == 32 - if (state == TASK_RUNNING) + if (0 && (state == TASK_RUNNING)) printk(KERN_CONT " running "); else printk(KERN_CONT " %08lx ", thread_saved_pc(p)); #else - if (state == TASK_RUNNING) + if (0 && (state == TASK_RUNNING)) printk(KERN_CONT " running task "); else printk(KERN_CONT " %016lx ", thread_saved_pc(p)); #endif + if (task_curr(p)) + printk("[curr] "); + else if (p->se.on_rq) + printk("[on rq #%d] ", task_cpu(p)); #ifdef CONFIG_DEBUG_STACK_USAGE - { - unsigned long *n = end_of_stack(p); - while (!*n) - n++; - free = (unsigned long)n - (unsigned long)end_of_stack(p); - } + free = stack_not_used(p); #endif printk(KERN_CONT "%5lu %5d %6d\n", free, task_pid_nr(p), task_pid_nr(p->real_parent)); @@ -5960,6 +6618,7 @@ void sched_show_task(struct task_struct void show_state_filter(unsigned long state_filter) { struct task_struct *g, *p; + int do_unlock = 1; #if BITS_PER_LONG == 32 printk(KERN_INFO @@ -5968,7 +6627,16 @@ void show_state_filter(unsigned long sta printk(KERN_INFO " task PC stack pid father\n"); #endif +#ifdef CONFIG_PREEMPT_RT + if (!read_trylock(&tasklist_lock)) { + printk("hm, tasklist_lock write-locked.\n"); + printk("ignoring ...\n"); + do_unlock = 0; + } +#else read_lock(&tasklist_lock); +#endif + do_each_thread(g, p) { /* * reset the NMI-timeout, listing all files on a slow @@ -5984,7 +6652,8 @@ void show_state_filter(unsigned long sta #ifdef CONFIG_SCHED_DEBUG sysrq_sched_debug_show(); #endif - read_unlock(&tasklist_lock); + if (do_unlock) + read_unlock(&tasklist_lock); /* * Only show locks if all tasks are dumped: */ @@ -6020,17 +6689,14 @@ void __cpuinit init_idle(struct task_str __set_task_cpu(idle, cpu); rq->curr = rq->idle = idle; -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) idle->oncpu = 1; #endif spin_unlock_irqrestore(&rq->lock, flags); /* Set the preempt count _outside_ the spinlocks! */ -#if defined(CONFIG_PREEMPT) - task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); -#else task_thread_info(idle)->preempt_count = 0; -#endif + /* * The idle tasks have their own, simple scheduling class: */ @@ -6159,11 +6825,18 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr); static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu) { struct rq *rq_dest, *rq_src; + unsigned long flags; int ret = 0, on_rq; if (unlikely(!cpu_active(dest_cpu))) return ret; + /* + * PREEMPT_RT: this relies on write_lock_irq(&tasklist_lock) + * disabling interrupts - which on PREEMPT_RT does not do: + */ + local_irq_save(flags); + rq_src = cpu_rq(src_cpu); rq_dest = cpu_rq(dest_cpu); @@ -6188,6 +6861,8 @@ done: ret = 1; fail: double_rq_unlock(rq_src, rq_dest); + local_irq_restore(flags); + return ret; } @@ -8218,11 +8893,15 @@ static void init_rt_rq(struct rt_rq *rt_ __set_bit(MAX_RT_PRIO, array->bitmap); #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED - rt_rq->highest_prio = MAX_RT_PRIO; + rt_rq->highest_prio.curr = MAX_RT_PRIO; +#ifdef CONFIG_SMP + rt_rq->highest_prio.next = MAX_RT_PRIO; +#endif #endif #ifdef CONFIG_SMP rt_rq->rt_nr_migratory = 0; rt_rq->overloaded = 0; + plist_head_init(&rq->rt.pushable_tasks, &rq->lock); #endif rt_rq->rt_time = 0; @@ -8481,6 +9160,9 @@ void __init sched_init(void) atomic_inc(&init_mm.mm_count); enter_lazy_tlb(&init_mm, current); +#ifdef CONFIG_PREEMPT_RT + printk("Real-Time Preemption Support (C) 2004-2007 Ingo Molnar\n"); +#endif /* * Make us the idle thread. Technically, schedule() should not be * called from this thread, however somewhere below it might be, @@ -8505,36 +9187,6 @@ void __init sched_init(void) scheduler_running = 1; } -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP -void __might_sleep(char *file, int line) -{ -#ifdef in_atomic - static unsigned long prev_jiffy; /* ratelimiting */ - - if ((!in_atomic() && !irqs_disabled()) || - system_state != SYSTEM_RUNNING || oops_in_progress) - return; - if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) - return; - prev_jiffy = jiffies; - - printk(KERN_ERR - "BUG: sleeping function called from invalid context at %s:%d\n", - file, line); - printk(KERN_ERR - "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", - in_atomic(), irqs_disabled(), - current->pid, current->comm); - - debug_show_held_locks(current); - if (irqs_disabled()) - print_irqtrace_events(current); - dump_stack(); -#endif -} -EXPORT_SYMBOL(__might_sleep); -#endif - #ifdef CONFIG_MAGIC_SYSRQ static void normalize_task(struct rq *rq, struct task_struct *p) { Index: linux-2.6-tip/kernel/sched_debug.c =================================================================== --- linux-2.6-tip.orig/kernel/sched_debug.c +++ linux-2.6-tip/kernel/sched_debug.c @@ -281,6 +281,19 @@ static void print_cpu(struct seq_file *m P(cpu_load[2]); P(cpu_load[3]); P(cpu_load[4]); +#ifdef CONFIG_PREEMPT_RT + /* Print rt related rq stats */ + P(rt.rt_nr_running); + P(rt.rt_nr_uninterruptible); +# ifdef CONFIG_SCHEDSTATS + P(rto_schedule); + P(rto_schedule_tail); + P(rto_wakeup); + P(rto_pulled); + P(rto_pushed); +# endif +#endif + #undef P #undef PN @@ -397,6 +410,7 @@ void proc_sched_show_task(struct task_st PN(se.vruntime); PN(se.sum_exec_runtime); PN(se.avg_overlap); + PN(se.avg_wakeup); nr_switches = p->nvcsw + p->nivcsw; Index: linux-2.6-tip/kernel/sched_fair.c =================================================================== --- linux-2.6-tip.orig/kernel/sched_fair.c +++ linux-2.6-tip/kernel/sched_fair.c @@ -1314,16 +1314,63 @@ out: } #endif /* CONFIG_SMP */ -static unsigned long wakeup_gran(struct sched_entity *se) +/* + * Adaptive granularity + * + * se->avg_wakeup gives the average time a task runs until it does a wakeup, + * with the limit of wakeup_gran -- when it never does a wakeup. + * + * So the smaller avg_wakeup is the faster we want this task to preempt, + * but we don't want to treat the preemptee unfairly and therefore allow it + * to run for at least the amount of time we'd like to run. + * + * NOTE: we use 2*avg_wakeup to increase the probability of actually doing one + * + * NOTE: we use *nr_running to scale with load, this nicely matches the + * degrading latency on load. + */ +static unsigned long +adaptive_gran(struct sched_entity *curr, struct sched_entity *se) +{ + u64 this_run = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + u64 expected_wakeup = 2*se->avg_wakeup * cfs_rq_of(se)->nr_running; + u64 gran = 0; + + if (this_run < expected_wakeup) + gran = expected_wakeup - this_run; + + return min_t(s64, gran, sysctl_sched_wakeup_granularity); +} + +static unsigned long +wakeup_gran(struct sched_entity *curr, struct sched_entity *se) { unsigned long gran = sysctl_sched_wakeup_granularity; + if (cfs_rq_of(curr)->curr && sched_feat(ADAPTIVE_GRAN)) + gran = adaptive_gran(curr, se); + /* - * More easily preempt - nice tasks, while not making it harder for - * + nice tasks. + * Since its curr running now, convert the gran from real-time + * to virtual-time in his units. */ - if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD) - gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se); + if (sched_feat(ASYM_GRAN)) { + /* + * By using 'se' instead of 'curr' we penalize light tasks, so + * they get preempted easier. That is, if 'se' < 'curr' then + * the resulting gran will be larger, therefore penalizing the + * lighter, if otoh 'se' > 'curr' then the resulting gran will + * be smaller, again penalizing the lighter task. + * + * This is especially important for buddies when the leftmost + * task is higher priority than the buddy. + */ + if (unlikely(se->load.weight != NICE_0_LOAD)) + gran = calc_delta_fair(gran, se); + } else { + if (unlikely(curr->load.weight != NICE_0_LOAD)) + gran = calc_delta_fair(gran, curr); + } return gran; } @@ -1350,7 +1397,7 @@ wakeup_preempt_entity(struct sched_entit if (vdiff <= 0) return -1; - gran = wakeup_gran(curr); + gran = wakeup_gran(curr, se); if (vdiff > gran) return 1; Index: linux-2.6-tip/kernel/sched_features.h =================================================================== --- linux-2.6-tip.orig/kernel/sched_features.h +++ linux-2.6-tip/kernel/sched_features.h @@ -1,5 +1,6 @@ SCHED_FEAT(NEW_FAIR_SLEEPERS, 1) -SCHED_FEAT(NORMALIZED_SLEEPER, 1) +SCHED_FEAT(NORMALIZED_SLEEPER, 0) +SCHED_FEAT(ADAPTIVE_GRAN, 1) SCHED_FEAT(WAKEUP_PREEMPT, 1) SCHED_FEAT(START_DEBIT, 1) SCHED_FEAT(AFFINE_WAKEUPS, 1) @@ -13,3 +14,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1) SCHED_FEAT(ASYM_EFF_LOAD, 1) SCHED_FEAT(WAKEUP_OVERLAP, 0) SCHED_FEAT(LAST_BUDDY, 1) +SCHED_FEAT(OWNER_SPIN, 1) Index: linux-2.6-tip/kernel/sched_rt.c =================================================================== --- linux-2.6-tip.orig/kernel/sched_rt.c +++ linux-2.6-tip/kernel/sched_rt.c @@ -3,6 +3,40 @@ * policies) */ +static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se) +{ + return container_of(rt_se, struct task_struct, rt); +} + +#ifdef CONFIG_RT_GROUP_SCHED + +static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq) +{ + return rt_rq->rq; +} + +static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se) +{ + return rt_se->rt_rq; +} + +#else /* CONFIG_RT_GROUP_SCHED */ + +static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq) +{ + return container_of(rt_rq, struct rq, rt); +} + +static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se) +{ + struct task_struct *p = rt_task_of(rt_se); + struct rq *rq = task_rq(p); + + return &rq->rt; +} + +#endif /* CONFIG_RT_GROUP_SCHED */ + #ifdef CONFIG_SMP static inline int rt_overloaded(struct rq *rq) @@ -37,25 +71,69 @@ static inline void rt_clear_overload(str cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); } -static void update_rt_migration(struct rq *rq) +static void update_rt_migration(struct rt_rq *rt_rq) { - if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) { - if (!rq->rt.overloaded) { - rt_set_overload(rq); - rq->rt.overloaded = 1; + if (rt_rq->rt_nr_migratory && (rt_rq->rt_nr_running > 1)) { + if (!rt_rq->overloaded) { + rt_set_overload(rq_of_rt_rq(rt_rq)); + rt_rq->overloaded = 1; } - } else if (rq->rt.overloaded) { - rt_clear_overload(rq); - rq->rt.overloaded = 0; + } else if (rt_rq->overloaded) { + rt_clear_overload(rq_of_rt_rq(rt_rq)); + rt_rq->overloaded = 0; } } -#endif /* CONFIG_SMP */ -static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se) +static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) { - return container_of(rt_se, struct task_struct, rt); + if (rt_se->nr_cpus_allowed > 1) + rt_rq->rt_nr_migratory++; + + update_rt_migration(rt_rq); } +static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ + if (rt_se->nr_cpus_allowed > 1) + rt_rq->rt_nr_migratory--; + + update_rt_migration(rt_rq); +} + +static void enqueue_pushable_task(struct rq *rq, struct task_struct *p) +{ + plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); + plist_node_init(&p->pushable_tasks, p->prio); + plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks); +} + +static void dequeue_pushable_task(struct rq *rq, struct task_struct *p) +{ + plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); +} + +#else + +static inline void enqueue_pushable_task(struct rq *rq, struct task_struct *p) +{ +} + +static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p) +{ +} + +static inline +void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ +} + +static inline +void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ +} + +#endif /* CONFIG_SMP */ + static inline int on_rt_rq(struct sched_rt_entity *rt_se) { return !list_empty(&rt_se->run_list); @@ -79,16 +157,6 @@ static inline u64 sched_rt_period(struct #define for_each_leaf_rt_rq(rt_rq, rq) \ list_for_each_entry_rcu(rt_rq, &rq->leaf_rt_rq_list, leaf_rt_rq_list) -static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq) -{ - return rt_rq->rq; -} - -static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se) -{ - return rt_se->rt_rq; -} - #define for_each_sched_rt_entity(rt_se) \ for (; rt_se; rt_se = rt_se->parent) @@ -108,7 +176,7 @@ static void sched_rt_rq_enqueue(struct r if (rt_rq->rt_nr_running) { if (rt_se && !on_rt_rq(rt_se)) enqueue_rt_entity(rt_se); - if (rt_rq->highest_prio < curr->prio) + if (rt_rq->highest_prio.curr < curr->prio) resched_task(curr); } } @@ -176,19 +244,6 @@ static inline u64 sched_rt_period(struct #define for_each_leaf_rt_rq(rt_rq, rq) \ for (rt_rq = &rq->rt; rt_rq; rt_rq = NULL) -static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq) -{ - return container_of(rt_rq, struct rq, rt); -} - -static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se) -{ - struct task_struct *p = rt_task_of(rt_se); - struct rq *rq = task_rq(p); - - return &rq->rt; -} - #define for_each_sched_rt_entity(rt_se) \ for (; rt_se; rt_se = NULL) @@ -473,7 +528,7 @@ static inline int rt_se_prio(struct sche struct rt_rq *rt_rq = group_rt_rq(rt_se); if (rt_rq) - return rt_rq->highest_prio; + return rt_rq->highest_prio.curr; #endif return rt_task_of(rt_se)->prio; @@ -547,91 +602,174 @@ static void update_curr_rt(struct rq *rq } } -static inline -void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +#if defined CONFIG_SMP + +static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu); + +static inline int next_prio(struct rq *rq) { - WARN_ON(!rt_prio(rt_se_prio(rt_se))); - rt_rq->rt_nr_running++; -#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED - if (rt_se_prio(rt_se) < rt_rq->highest_prio) { -#ifdef CONFIG_SMP - struct rq *rq = rq_of_rt_rq(rt_rq); -#endif + struct task_struct *next = pick_next_highest_task_rt(rq, rq->cpu); + + if (next && rt_prio(next->prio)) + return next->prio; + else + return MAX_RT_PRIO; +} + +static void +inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) +{ + struct rq *rq = rq_of_rt_rq(rt_rq); + + if (prio < prev_prio) { + + /* + * If the new task is higher in priority than anything on the + * run-queue, we know that the previous high becomes our + * next-highest. + */ + rt_rq->highest_prio.next = prev_prio; - rt_rq->highest_prio = rt_se_prio(rt_se); -#ifdef CONFIG_SMP if (rq->online) - cpupri_set(&rq->rd->cpupri, rq->cpu, - rt_se_prio(rt_se)); -#endif - } -#endif -#ifdef CONFIG_SMP - if (rt_se->nr_cpus_allowed > 1) { - struct rq *rq = rq_of_rt_rq(rt_rq); + cpupri_set(&rq->rd->cpupri, rq->cpu, prio); - rq->rt.rt_nr_migratory++; - } + } else if (prio == rt_rq->highest_prio.curr) + /* + * If the next task is equal in priority to the highest on + * the run-queue, then we implicitly know that the next highest + * task cannot be any lower than current + */ + rt_rq->highest_prio.next = prio; + else if (prio < rt_rq->highest_prio.next) + /* + * Otherwise, we need to recompute next-highest + */ + rt_rq->highest_prio.next = next_prio(rq); +} - update_rt_migration(rq_of_rt_rq(rt_rq)); -#endif -#ifdef CONFIG_RT_GROUP_SCHED - if (rt_se_boosted(rt_se)) - rt_rq->rt_nr_boosted++; +static void +dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) +{ + struct rq *rq = rq_of_rt_rq(rt_rq); - if (rt_rq->tg) - start_rt_bandwidth(&rt_rq->tg->rt_bandwidth); -#else - start_rt_bandwidth(&def_rt_bandwidth); -#endif + if (rt_rq->rt_nr_running && (prio <= rt_rq->highest_prio.next)) + rt_rq->highest_prio.next = next_prio(rq); + + if (rq->online && rt_rq->highest_prio.curr != prev_prio) + cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr); } +#else /* CONFIG_SMP */ + static inline -void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) -{ -#ifdef CONFIG_SMP - int highest_prio = rt_rq->highest_prio; -#endif +void inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {} +static inline +void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {} + +#endif /* CONFIG_SMP */ - WARN_ON(!rt_prio(rt_se_prio(rt_se))); - WARN_ON(!rt_rq->rt_nr_running); - rt_rq->rt_nr_running--; #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED +static void +inc_rt_prio(struct rt_rq *rt_rq, int prio) +{ + int prev_prio = rt_rq->highest_prio.curr; + + if (prio < prev_prio) + rt_rq->highest_prio.curr = prio; + + inc_rt_prio_smp(rt_rq, prio, prev_prio); +} + +static void +dec_rt_prio(struct rt_rq *rt_rq, int prio) +{ + int prev_prio = rt_rq->highest_prio.curr; + if (rt_rq->rt_nr_running) { - struct rt_prio_array *array; - WARN_ON(rt_se_prio(rt_se) < rt_rq->highest_prio); - if (rt_se_prio(rt_se) == rt_rq->highest_prio) { - /* recalculate */ - array = &rt_rq->active; - rt_rq->highest_prio = + WARN_ON(prio < prev_prio); + + /* + * This may have been our highest task, and therefore + * we may have some recomputation to do + */ + if (prio == prev_prio) { + struct rt_prio_array *array = &rt_rq->active; + + rt_rq->highest_prio.curr = sched_find_first_bit(array->bitmap); - } /* otherwise leave rq->highest prio alone */ + } + } else - rt_rq->highest_prio = MAX_RT_PRIO; -#endif -#ifdef CONFIG_SMP - if (rt_se->nr_cpus_allowed > 1) { - struct rq *rq = rq_of_rt_rq(rt_rq); - rq->rt.rt_nr_migratory--; - } + rt_rq->highest_prio.curr = MAX_RT_PRIO; - if (rt_rq->highest_prio != highest_prio) { - struct rq *rq = rq_of_rt_rq(rt_rq); + dec_rt_prio_smp(rt_rq, prio, prev_prio); +} - if (rq->online) - cpupri_set(&rq->rd->cpupri, rq->cpu, - rt_rq->highest_prio); - } +#else + +static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {} +static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {} + +#endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */ - update_rt_migration(rq_of_rt_rq(rt_rq)); -#endif /* CONFIG_SMP */ #ifdef CONFIG_RT_GROUP_SCHED + +static void +inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ + if (rt_se_boosted(rt_se)) + rt_rq->rt_nr_boosted++; + + if (rt_rq->tg) + start_rt_bandwidth(&rt_rq->tg->rt_bandwidth); +} + +static void +dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ if (rt_se_boosted(rt_se)) rt_rq->rt_nr_boosted--; WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted); -#endif +} + +#else /* CONFIG_RT_GROUP_SCHED */ + +static void +inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ + start_rt_bandwidth(&def_rt_bandwidth); +} + +static inline +void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {} + +#endif /* CONFIG_RT_GROUP_SCHED */ + +static inline +void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ + int prio = rt_se_prio(rt_se); + + WARN_ON(!rt_prio(prio)); + rt_rq->rt_nr_running++; + + inc_rt_prio(rt_rq, prio); + inc_rt_migration(rt_se, rt_rq); + inc_rt_group(rt_se, rt_rq); +} + +static inline +void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) +{ + WARN_ON(!rt_prio(rt_se_prio(rt_se))); + WARN_ON(!rt_rq->rt_nr_running); + rt_rq->rt_nr_running--; + + dec_rt_prio(rt_rq, rt_se_prio(rt_se)); + dec_rt_migration(rt_se, rt_rq); + dec_rt_group(rt_se, rt_rq); } static void __enqueue_rt_entity(struct sched_rt_entity *rt_se) @@ -706,6 +844,55 @@ static void dequeue_rt_entity(struct sch } } +static inline void incr_rt_nr_uninterruptible(struct task_struct *p, + struct rq *rq) +{ + rq->rt.rt_nr_uninterruptible++; +} + +static inline void decr_rt_nr_uninterruptible(struct task_struct *p, + struct rq *rq) +{ + rq->rt.rt_nr_uninterruptible--; +} + +unsigned long rt_nr_running(void) +{ + unsigned long i, sum = 0; + + for_each_online_cpu(i) + sum += cpu_rq(i)->rt.rt_nr_running; + + return sum; +} + +unsigned long rt_nr_running_cpu(int cpu) +{ + return cpu_rq(cpu)->rt.rt_nr_running; +} + +unsigned long rt_nr_uninterruptible(void) +{ + unsigned long i, sum = 0; + + for_each_online_cpu(i) + sum += cpu_rq(i)->rt.rt_nr_uninterruptible; + + /* + * Since we read the counters lockless, it might be slightly + * inaccurate. Do not allow it to go below zero though: + */ + if (unlikely((long)sum < 0)) + sum = 0; + + return sum; +} + +unsigned long rt_nr_uninterruptible_cpu(int cpu) +{ + return cpu_rq(cpu)->rt.rt_nr_uninterruptible; +} + /* * Adding/removing a task to/from a priority array: */ @@ -718,6 +905,12 @@ static void enqueue_task_rt(struct rq *r enqueue_rt_entity(rt_se); + if (p->state == TASK_UNINTERRUPTIBLE) + decr_rt_nr_uninterruptible(p, rq); + + if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1) + enqueue_pushable_task(rq, p); + inc_cpu_load(rq, p->se.load.weight); } @@ -726,8 +919,14 @@ static void dequeue_task_rt(struct rq *r struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); + + if (p->state == TASK_UNINTERRUPTIBLE) + incr_rt_nr_uninterruptible(p, rq); + dequeue_rt_entity(rt_se); + dequeue_pushable_task(rq, p); + dec_cpu_load(rq, p->se.load.weight); } @@ -878,7 +1077,7 @@ static struct sched_rt_entity *pick_next return next; } -static struct task_struct *pick_next_task_rt(struct rq *rq) +static struct task_struct *_pick_next_task_rt(struct rq *rq) { struct sched_rt_entity *rt_se; struct task_struct *p; @@ -900,6 +1099,18 @@ static struct task_struct *pick_next_tas p = rt_task_of(rt_se); p->se.exec_start = rq->clock; + + return p; +} + +static struct task_struct *pick_next_task_rt(struct rq *rq) +{ + struct task_struct *p = _pick_next_task_rt(rq); + + /* The running task is never eligible for pushing */ + if (p) + dequeue_pushable_task(rq, p); + return p; } @@ -907,6 +1118,13 @@ static void put_prev_task_rt(struct rq * { update_curr_rt(rq); p->se.exec_start = 0; + + /* + * The previous task needs to be made eligible for pushing + * if it is still active + */ + if (p->se.on_rq && p->rt.nr_cpus_allowed > 1) + enqueue_pushable_task(rq, p); } #ifdef CONFIG_SMP @@ -960,12 +1178,13 @@ static struct task_struct *pick_next_hig static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask); -static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask) +static inline int pick_optimal_cpu(int this_cpu, + const struct cpumask *mask) { int first; /* "this_cpu" is cheaper to preempt than a remote processor */ - if ((this_cpu != -1) && cpu_isset(this_cpu, *mask)) + if ((this_cpu != -1) && cpumask_test_cpu(this_cpu, mask)) return this_cpu; first = cpumask_first(mask); @@ -981,6 +1200,7 @@ static int find_lowest_rq(struct task_st struct cpumask *lowest_mask = __get_cpu_var(local_cpu_mask); int this_cpu = smp_processor_id(); int cpu = task_cpu(task); + cpumask_var_t domain_mask; if (task->rt.nr_cpus_allowed == 1) return -1; /* No other targets possible */ @@ -1013,19 +1233,25 @@ static int find_lowest_rq(struct task_st if (this_cpu == cpu) this_cpu = -1; /* Skip this_cpu opt if the same */ - for_each_domain(cpu, sd) { - if (sd->flags & SD_WAKE_AFFINE) { - cpumask_t domain_mask; - int best_cpu; - - cpumask_and(&domain_mask, sched_domain_span(sd), - lowest_mask); - - best_cpu = pick_optimal_cpu(this_cpu, - &domain_mask); - if (best_cpu != -1) - return best_cpu; + if (alloc_cpumask_var(&domain_mask, GFP_ATOMIC)) { + for_each_domain(cpu, sd) { + if (sd->flags & SD_WAKE_AFFINE) { + int best_cpu; + + cpumask_and(domain_mask, + sched_domain_span(sd), + lowest_mask); + + best_cpu = pick_optimal_cpu(this_cpu, + domain_mask); + + if (best_cpu != -1) { + free_cpumask_var(domain_mask); + return best_cpu; + } + } } + free_cpumask_var(domain_mask); } /* @@ -1072,7 +1298,7 @@ static struct rq *find_lock_lowest_rq(st } /* If this rq is still suitable use it. */ - if (lowest_rq->rt.highest_prio > task->prio) + if (lowest_rq->rt.highest_prio.curr > task->prio) break; /* try again */ @@ -1083,6 +1309,31 @@ static struct rq *find_lock_lowest_rq(st return lowest_rq; } +static inline int has_pushable_tasks(struct rq *rq) +{ + return !plist_head_empty(&rq->rt.pushable_tasks); +} + +static struct task_struct *pick_next_pushable_task(struct rq *rq) +{ + struct task_struct *p; + + if (!has_pushable_tasks(rq)) + return NULL; + + p = plist_first_entry(&rq->rt.pushable_tasks, + struct task_struct, pushable_tasks); + + BUG_ON(rq->cpu != task_cpu(p)); + BUG_ON(task_current(rq, p)); + BUG_ON(p->rt.nr_cpus_allowed <= 1); + + BUG_ON(!p->se.on_rq); + BUG_ON(!rt_task(p)); + + return p; +} + /* * If the current CPU has more than one RT task, see if the non * running task can migrate over to a CPU that is running a task @@ -1092,13 +1343,11 @@ static int push_rt_task(struct rq *rq) { struct task_struct *next_task; struct rq *lowest_rq; - int ret = 0; - int paranoid = RT_MAX_TRIES; if (!rq->rt.overloaded) return 0; - next_task = pick_next_highest_task_rt(rq, -1); + next_task = pick_next_pushable_task(rq); if (!next_task) return 0; @@ -1127,16 +1376,34 @@ static int push_rt_task(struct rq *rq) struct task_struct *task; /* * find lock_lowest_rq releases rq->lock - * so it is possible that next_task has changed. - * If it has, then try again. + * so it is possible that next_task has migrated. + * + * We need to make sure that the task is still on the same + * run-queue and is also still the next task eligible for + * pushing. */ - task = pick_next_highest_task_rt(rq, -1); - if (unlikely(task != next_task) && task && paranoid--) { - put_task_struct(next_task); - next_task = task; - goto retry; + task = pick_next_pushable_task(rq); + if (task_cpu(next_task) == rq->cpu && task == next_task) { + /* + * If we get here, the task hasnt moved at all, but + * it has failed to push. We will not try again, + * since the other cpus will pull from us when they + * are ready. + */ + dequeue_pushable_task(rq, next_task); + goto out; } - goto out; + + if (!task) + /* No more tasks, just exit */ + goto out; + + /* + * Something has shifted, try again. + */ + put_task_struct(next_task); + next_task = task; + goto retry; } deactivate_task(rq, next_task, 0); @@ -1147,23 +1414,12 @@ static int push_rt_task(struct rq *rq) double_unlock_balance(rq, lowest_rq); - ret = 1; out: put_task_struct(next_task); - return ret; + return 1; } -/* - * TODO: Currently we just use the second highest prio task on - * the queue, and stop when it can't migrate (or there's - * no more RT tasks). There may be a case where a lower - * priority RT task has a different affinity than the - * higher RT task. In this case the lower RT task could - * possibly be able to migrate where as the higher priority - * RT task could not. We currently ignore this issue. - * Enhancements are welcome! - */ static void push_rt_tasks(struct rq *rq) { /* push_rt_task will return true if it moved an RT */ @@ -1174,33 +1430,35 @@ static void push_rt_tasks(struct rq *rq) static int pull_rt_task(struct rq *this_rq) { int this_cpu = this_rq->cpu, ret = 0, cpu; - struct task_struct *p, *next; + struct task_struct *p; struct rq *src_rq; if (likely(!rt_overloaded(this_rq))) return 0; - next = pick_next_task_rt(this_rq); - for_each_cpu(cpu, this_rq->rd->rto_mask) { if (this_cpu == cpu) continue; src_rq = cpu_rq(cpu); + + /* + * Don't bother taking the src_rq->lock if the next highest + * task is known to be lower-priority than our current task. + * This may look racy, but if this value is about to go + * logically higher, the src_rq will push this task away. + * And if its going logically lower, we do not care + */ + if (src_rq->rt.highest_prio.next >= + this_rq->rt.highest_prio.curr) + continue; + /* * We can potentially drop this_rq's lock in * double_lock_balance, and another CPU could - * steal our next task - hence we must cause - * the caller to recalculate the next task - * in that case: + * alter this_rq */ - if (double_lock_balance(this_rq, src_rq)) { - struct task_struct *old_next = next; - - next = pick_next_task_rt(this_rq); - if (next != old_next) - ret = 1; - } + double_lock_balance(this_rq, src_rq); /* * Are there still pullable RT tasks? @@ -1214,7 +1472,7 @@ static int pull_rt_task(struct rq *this_ * Do we have an RT task that preempts * the to-be-scheduled task? */ - if (p && (!next || (p->prio < next->prio))) { + if (p && (p->prio < this_rq->rt.highest_prio.curr)) { WARN_ON(p == src_rq->curr); WARN_ON(!p->se.on_rq); @@ -1224,12 +1482,9 @@ static int pull_rt_task(struct rq *this_ * This is just that p is wakeing up and hasn't * had a chance to schedule. We only pull * p if it is lower in priority than the - * current task on the run queue or - * this_rq next task is lower in prio than - * the current task on that rq. + * current task on the run queue */ - if (p->prio < src_rq->curr->prio || - (next && next->prio < src_rq->curr->prio)) + if (p->prio < src_rq->curr->prio) goto skip; ret = 1; @@ -1242,13 +1497,7 @@ static int pull_rt_task(struct rq *this_ * case there's an even higher prio task * in another runqueue. (low likelyhood * but possible) - * - * Update next so that we won't pick a task - * on another cpu with a priority lower (or equal) - * than the one we just picked. */ - next = p; - } skip: double_unlock_balance(this_rq, src_rq); @@ -1260,24 +1509,29 @@ static int pull_rt_task(struct rq *this_ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev) { /* Try to pull RT tasks here if we lower this rq's prio */ - if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio) + if (unlikely(rt_task(prev)) && rq->rt.highest_prio.curr > prev->prio) { pull_rt_task(rq); + schedstat_inc(rq, rto_schedule); + } +} + +/* + * assumes rq->lock is held + */ +static int needs_post_schedule_rt(struct rq *rq) +{ + return has_pushable_tasks(rq); } static void post_schedule_rt(struct rq *rq) { /* - * If we have more than one rt_task queued, then - * see if we can push the other rt_tasks off to other CPUS. - * Note we may release the rq lock, and since - * the lock was owned by prev, we need to release it - * first via finish_lock_switch and then reaquire it here. + * This is only called if needs_post_schedule_rt() indicates that + * we need to push tasks away */ - if (unlikely(rq->rt.overloaded)) { - spin_lock_irq(&rq->lock); - push_rt_tasks(rq); - spin_unlock_irq(&rq->lock); - } + spin_lock_irq(&rq->lock); + push_rt_tasks(rq); + spin_unlock_irq(&rq->lock); } /* @@ -1288,7 +1542,8 @@ static void task_wake_up_rt(struct rq *r { if (!task_running(rq, p) && !test_tsk_need_resched(rq->curr) && - rq->rt.overloaded) + has_pushable_tasks(rq) && + p->rt.nr_cpus_allowed > 1) push_rt_tasks(rq); } @@ -1324,6 +1579,23 @@ static void set_cpus_allowed_rt(struct t if (p->se.on_rq && (weight != p->rt.nr_cpus_allowed)) { struct rq *rq = task_rq(p); + if (!task_current(rq, p)) { + /* + * Make sure we dequeue this task from the pushable list + * before going further. It will either remain off of + * the list because we are no longer pushable, or it + * will be requeued. + */ + if (p->rt.nr_cpus_allowed > 1) + dequeue_pushable_task(rq, p); + + /* + * Requeue if our weight is changing and still > 1 + */ + if (weight > 1) + enqueue_pushable_task(rq, p); + } + if ((p->rt.nr_cpus_allowed <= 1) && (weight > 1)) { rq->rt.rt_nr_migratory++; } else if ((p->rt.nr_cpus_allowed > 1) && (weight <= 1)) { @@ -1331,7 +1603,7 @@ static void set_cpus_allowed_rt(struct t rq->rt.rt_nr_migratory--; } - update_rt_migration(rq); + update_rt_migration(&rq->rt); } cpumask_copy(&p->cpus_allowed, new_mask); @@ -1346,7 +1618,7 @@ static void rq_online_rt(struct rq *rq) __enable_runtime(rq); - cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio); + cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr); } /* Assumes rq->lock is held */ @@ -1438,7 +1710,7 @@ static void prio_changed_rt(struct rq *r * can release the rq lock and p could migrate. * Only reschedule if p is still on the same runqueue. */ - if (p->prio > rq->rt.highest_prio && rq->curr == p) + if (p->prio > rq->rt.highest_prio.curr && rq->curr == p) resched_task(p); #else /* For UP simply resched on drop of prio */ @@ -1509,6 +1781,9 @@ static void set_curr_task_rt(struct rq * struct task_struct *p = rq->curr; p->se.exec_start = rq->clock; + + /* The running task is never eligible for pushing */ + dequeue_pushable_task(rq, p); } static const struct sched_class rt_sched_class = { @@ -1531,6 +1806,7 @@ static const struct sched_class rt_sched .rq_online = rq_online_rt, .rq_offline = rq_offline_rt, .pre_schedule = pre_schedule_rt, + .needs_post_schedule = needs_post_schedule_rt, .post_schedule = post_schedule_rt, .task_wake_up = task_wake_up_rt, .switched_from = switched_from_rt, Index: linux-2.6-tip/kernel/softirq.c =================================================================== --- linux-2.6-tip.orig/kernel/softirq.c +++ linux-2.6-tip/kernel/softirq.c @@ -8,19 +8,28 @@ * Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903) * * Remote softirq infrastructure is by Jens Axboe. + * + * Softirq-split implemetation by + * Copyright (C) 2005 Thomas Gleixner, Ingo Molnar */ #include +#include +#include +#include #include #include #include +#include #include #include #include +#include #include #include #include #include +#include #include #include @@ -50,7 +59,41 @@ EXPORT_SYMBOL(irq_stat); static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; -static DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +struct softirqdata { + int nr; + unsigned long cpu; + struct task_struct *tsk; +#ifdef CONFIG_PREEMPT_SOFTIRQS + wait_queue_head_t wait; + int running; +#endif +}; + +static DEFINE_PER_CPU(struct softirqdata [MAX_SOFTIRQ], ksoftirqd); + +#ifdef CONFIG_PREEMPT_SOFTIRQS +/* + * Preempting the softirq causes cases that would not be a + * problem when the softirq is not preempted. That is a + * process may have code to spin while waiting for a softirq + * to finish on another CPU. But if it happens that the + * process has preempted the softirq, this could cause a + * deadlock. + */ +void wait_for_softirq(int softirq) +{ + struct softirqdata *data = &__get_cpu_var(ksoftirqd)[softirq]; + if (data->running) { + DECLARE_WAITQUEUE(wait, current); + set_current_state(TASK_UNINTERRUPTIBLE); + add_wait_queue(&data->wait, &wait); + if (data->running) + schedule(); + remove_wait_queue(&data->wait, &wait); + __set_current_state(TASK_RUNNING); + } +} +#endif /* * we cannot loop indefinitely here to avoid userspace starvation, @@ -58,16 +101,34 @@ static DEFINE_PER_CPU(struct task_struct * to the pending events, so lets the scheduler to balance * the softirq load for us. */ -static inline void wakeup_softirqd(void) +static void wakeup_softirqd(int softirq) { /* Interrupts are disabled: no need to stop preemption */ - struct task_struct *tsk = __get_cpu_var(ksoftirqd); + struct task_struct *tsk = __get_cpu_var(ksoftirqd)[softirq].tsk; if (tsk && tsk->state != TASK_RUNNING) wake_up_process(tsk); } /* + * Wake up the softirq threads which have work + */ +static void trigger_softirqs(void) +{ + u32 pending = local_softirq_pending(); + int curr = 0; + + while (pending) { + if (pending & 1) + wakeup_softirqd(curr); + pending >>= 1; + curr++; + } +} + +#ifndef CONFIG_PREEMPT_HARDIRQS + +/* * This one is for softirq.c-internal use, * where hardirqs are disabled legitimately: */ @@ -79,13 +140,23 @@ static void __local_bh_disable(unsigned WARN_ON_ONCE(in_irq()); raw_local_irq_save(flags); - add_preempt_count(SOFTIRQ_OFFSET); + /* + * The preempt tracer hooks into add_preempt_count and will break + * lockdep because it calls back into lockdep after SOFTIRQ_OFFSET + * is set and before current->softirq_enabled is cleared. + * We must manually increment preempt_count here and manually + * call the trace_preempt_off later. + */ + preempt_count() += SOFTIRQ_OFFSET; /* * Were softirqs turned off above: */ if (softirq_count() == SOFTIRQ_OFFSET) trace_softirqs_off(ip); raw_local_irq_restore(flags); + + if (preempt_count() == SOFTIRQ_OFFSET) + trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); } #else /* !CONFIG_TRACE_IRQFLAGS */ static inline void __local_bh_disable(unsigned long ip) @@ -109,7 +180,6 @@ EXPORT_SYMBOL(local_bh_disable); */ void _local_bh_enable(void) { - WARN_ON_ONCE(in_irq()); WARN_ON_ONCE(!irqs_disabled()); if (softirq_count() == SOFTIRQ_OFFSET) @@ -119,17 +189,22 @@ void _local_bh_enable(void) EXPORT_SYMBOL(_local_bh_enable); -static inline void _local_bh_enable_ip(unsigned long ip) +void local_bh_enable(void) { - WARN_ON_ONCE(in_irq() || irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_disable(); + unsigned long flags; + + WARN_ON_ONCE(in_irq()); +#endif + +#ifdef CONFIG_TRACE_IRQFLAGS + local_irq_save(flags); #endif /* * Are softirqs going to be turned on now: */ if (softirq_count() == SOFTIRQ_OFFSET) - trace_softirqs_on(ip); + trace_softirqs_on((unsigned long)__builtin_return_address(0)); /* * Keep preemption disabled until we are done with * softirq processing: @@ -141,23 +216,45 @@ static inline void _local_bh_enable_ip(u dec_preempt_count(); #ifdef CONFIG_TRACE_IRQFLAGS - local_irq_enable(); + local_irq_restore(flags); #endif preempt_check_resched(); } - -void local_bh_enable(void) -{ - _local_bh_enable_ip((unsigned long)__builtin_return_address(0)); -} EXPORT_SYMBOL(local_bh_enable); void local_bh_enable_ip(unsigned long ip) { - _local_bh_enable_ip(ip); +#ifdef CONFIG_TRACE_IRQFLAGS + unsigned long flags; + + WARN_ON_ONCE(in_irq()); + + local_irq_save(flags); +#endif + /* + * Are softirqs going to be turned on now: + */ + if (softirq_count() == SOFTIRQ_OFFSET) + trace_softirqs_on(ip); + /* + * Keep preemption disabled until we are done with + * softirq processing: + */ + sub_preempt_count(SOFTIRQ_OFFSET - 1); + + if (unlikely(!in_interrupt() && local_softirq_pending())) + do_softirq(); + + dec_preempt_count(); +#ifdef CONFIG_TRACE_IRQFLAGS + local_irq_restore(flags); +#endif + preempt_check_resched(); } EXPORT_SYMBOL(local_bh_enable_ip); +#endif + /* * We restart softirq processing MAX_SOFTIRQ_RESTART times, * and we fall back to softirqd after that. @@ -167,63 +264,143 @@ EXPORT_SYMBOL(local_bh_enable_ip); * we want to handle softirqs as soon as possible, but they * should not be able to lock up the box. */ -#define MAX_SOFTIRQ_RESTART 10 +#define MAX_SOFTIRQ_RESTART 20 -asmlinkage void __do_softirq(void) +static DEFINE_PER_CPU(u32, softirq_running); + +/* + * Debug check for leaking preempt counts in h->action handlers: + */ + +static inline void debug_check_preempt_count_start(__u32 *preempt_count) { - struct softirq_action *h; - __u32 pending; +#ifdef CONFIG_DEBUG_PREEMPT + *preempt_count = preempt_count(); +#endif +} + +static inline void + debug_check_preempt_count_stop(__u32 *preempt_count, struct softirq_action *h) +{ +#ifdef CONFIG_DEBUG_PREEMPT + if (*preempt_count == preempt_count()) + return; + + print_symbol("BUG: %Ps exited with wrong preemption count!\n", + (unsigned long)h->action); + printk("=> enter: %08x, exit: %08x.\n", *preempt_count, preempt_count()); + preempt_count() = *preempt_count; +#endif +} + +/* + * Execute softirq handlers: + */ +static void ___do_softirq(const int same_prio_only) +{ + __u32 pending, available_mask, same_prio_skipped, preempt_count; int max_restart = MAX_SOFTIRQ_RESTART; - int cpu; + struct softirq_action *h; + int cpu, softirq; pending = local_softirq_pending(); account_system_vtime(current); - __local_bh_disable((unsigned long)__builtin_return_address(0)); - trace_softirq_enter(); - cpu = smp_processor_id(); restart: + available_mask = -1; + softirq = 0; + same_prio_skipped = 0; /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); - local_irq_enable(); - h = softirq_vec; do { - if (pending & 1) { - int prev_count = preempt_count(); + u32 softirq_mask = 1 << softirq; - h->action(h); + if (!(pending & 1)) + goto next; - if (unlikely(prev_count != preempt_count())) { - printk(KERN_ERR "huh, entered softirq %td %p" - "with preempt_count %08x," - " exited with %08x?\n", h - softirq_vec, - h->action, prev_count, preempt_count()); - preempt_count() = prev_count; - } + debug_check_preempt_count_start(&preempt_count); - rcu_bh_qsctr_inc(cpu); +#if defined(CONFIG_PREEMPT_SOFTIRQS) && defined(CONFIG_PREEMPT_HARDIRQS) + /* + * If executed by a same-prio hardirq thread + * then skip pending softirqs that belong + * to softirq threads with different priority: + */ + if (same_prio_only) { + struct task_struct *tsk; + + tsk = __get_cpu_var(ksoftirqd)[softirq].tsk; + if (tsk && tsk->normal_prio != current->normal_prio) { + same_prio_skipped |= softirq_mask; + available_mask &= ~softirq_mask; + goto next; + } + } +#endif + /* + * Is this softirq already being processed? + */ + if (per_cpu(softirq_running, cpu) & softirq_mask) { + available_mask &= ~softirq_mask; + goto next; } + per_cpu(softirq_running, cpu) |= softirq_mask; + local_irq_enable(); + + h->action(h); + + debug_check_preempt_count_stop(&preempt_count, h); + + rcu_bh_qsctr_inc(cpu); + cond_resched_softirq_context(); + local_irq_disable(); + per_cpu(softirq_running, cpu) &= ~softirq_mask; +next: h++; + softirq++; pending >>= 1; } while (pending); - local_irq_disable(); - + or_softirq_pending(same_prio_skipped); pending = local_softirq_pending(); - if (pending && --max_restart) - goto restart; + if (pending & available_mask) { + if (--max_restart) + goto restart; + } if (pending) - wakeup_softirqd(); + trigger_softirqs(); +} + +asmlinkage void __do_softirq(void) +{ +#ifdef CONFIG_PREEMPT_SOFTIRQS + /* + * 'preempt harder'. Push all softirq processing off to ksoftirqd. + */ + if (softirq_preemption) { + if (local_softirq_pending()) + trigger_softirqs(); + return; + } +#endif + /* + * 'immediate' softirq execution: + */ + __local_bh_disable((unsigned long)__builtin_return_address(0)); + trace_softirq_enter(); + + ___do_softirq(0); trace_softirq_exit(); account_system_vtime(current); _local_bh_enable(); + } #ifndef __ARCH_HAS_DO_SOFTIRQ @@ -286,7 +463,7 @@ void irq_exit(void) if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched()) tick_nohz_stop_sched_tick(0); #endif - preempt_enable_no_resched(); + __preempt_enable_no_resched(); } /* @@ -294,19 +471,11 @@ void irq_exit(void) */ inline void raise_softirq_irqoff(unsigned int nr) { - __raise_softirq_irqoff(nr); + __do_raise_softirq_irqoff(nr); - /* - * If we're in an interrupt or softirq, we're done - * (this also catches softirq-disabled code). We will - * actually run the softirq once we return from - * the irq or softirq. - * - * Otherwise we wake up ksoftirqd to make sure we - * schedule the softirq soon. - */ - if (!in_interrupt()) - wakeup_softirqd(); +#ifdef CONFIG_PREEMPT_SOFTIRQS + wakeup_softirqd(nr); +#endif } void raise_softirq(unsigned int nr) @@ -333,15 +502,45 @@ struct tasklet_head static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec); static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec); +static void inline +__tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr) +{ + if (tasklet_trylock(t)) { +again: + /* We may have been preempted before tasklet_trylock + * and __tasklet_action may have already run. + * So double check the sched bit while the takslet + * is locked before adding it to the list. + */ + if (test_bit(TASKLET_STATE_SCHED, &t->state)) { + t->next = NULL; + *head->tail = t; + head->tail = &(t->next); + raise_softirq_irqoff(nr); + tasklet_unlock(t); + } else { + /* This is subtle. If we hit the corner case above + * It is possible that we get preempted right here, + * and another task has successfully called + * tasklet_schedule(), then this function, and + * failed on the trylock. Thus we must be sure + * before releasing the tasklet lock, that the + * SCHED_BIT is clear. Otherwise the tasklet + * may get its SCHED_BIT set, but not added to the + * list + */ + if (!tasklet_tryunlock(t)) + goto again; + } + } +} + void __tasklet_schedule(struct tasklet_struct *t) { unsigned long flags; local_irq_save(flags); - t->next = NULL; - *__get_cpu_var(tasklet_vec).tail = t; - __get_cpu_var(tasklet_vec).tail = &(t->next); - raise_softirq_irqoff(TASKLET_SOFTIRQ); + __tasklet_common_schedule(t, &__get_cpu_var(tasklet_vec), TASKLET_SOFTIRQ); local_irq_restore(flags); } @@ -352,50 +551,127 @@ void __tasklet_hi_schedule(struct taskle unsigned long flags; local_irq_save(flags); - t->next = NULL; - *__get_cpu_var(tasklet_hi_vec).tail = t; - __get_cpu_var(tasklet_hi_vec).tail = &(t->next); - raise_softirq_irqoff(HI_SOFTIRQ); + __tasklet_common_schedule(t, &__get_cpu_var(tasklet_hi_vec), HI_SOFTIRQ); local_irq_restore(flags); } EXPORT_SYMBOL(__tasklet_hi_schedule); -static void tasklet_action(struct softirq_action *a) +void __tasklet_hi_schedule_first(struct tasklet_struct *t) { - struct tasklet_struct *list; + __tasklet_hi_schedule(t); +} - local_irq_disable(); - list = __get_cpu_var(tasklet_vec).head; - __get_cpu_var(tasklet_vec).head = NULL; - __get_cpu_var(tasklet_vec).tail = &__get_cpu_var(tasklet_vec).head; - local_irq_enable(); +EXPORT_SYMBOL(__tasklet_hi_schedule_first); + +void tasklet_enable(struct tasklet_struct *t) +{ + if (!atomic_dec_and_test(&t->count)) + return; + if (test_and_clear_bit(TASKLET_STATE_PENDING, &t->state)) + tasklet_schedule(t); +} + +EXPORT_SYMBOL(tasklet_enable); + +void tasklet_hi_enable(struct tasklet_struct *t) +{ + if (!atomic_dec_and_test(&t->count)) + return; + if (test_and_clear_bit(TASKLET_STATE_PENDING, &t->state)) + tasklet_hi_schedule(t); +} + +EXPORT_SYMBOL(tasklet_hi_enable); + +static void +__tasklet_action(struct softirq_action *a, struct tasklet_struct *list) +{ + int loops = 1000000; while (list) { struct tasklet_struct *t = list; list = list->next; - if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { - if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) - BUG(); - t->func(t->data); - tasklet_unlock(t); - continue; - } - tasklet_unlock(t); + /* + * Should always succeed - after a tasklist got on the + * list (after getting the SCHED bit set from 0 to 1), + * nothing but the tasklet softirq it got queued to can + * lock it: + */ + if (!tasklet_trylock(t)) { + WARN_ON(1); + continue; } - local_irq_disable(); t->next = NULL; - *__get_cpu_var(tasklet_vec).tail = t; - __get_cpu_var(tasklet_vec).tail = &(t->next); - __raise_softirq_irqoff(TASKLET_SOFTIRQ); - local_irq_enable(); + + /* + * If we cannot handle the tasklet because it's disabled, + * mark it as pending. tasklet_enable() will later + * re-schedule the tasklet. + */ + if (unlikely(atomic_read(&t->count))) { +out_disabled: + /* implicit unlock: */ + wmb(); + t->state = TASKLET_STATEF_PENDING; + continue; + } + + /* + * After this point on the tasklet might be rescheduled + * on another CPU, but it can only be added to another + * CPU's tasklet list if we unlock the tasklet (which we + * dont do yet). + */ + if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) + WARN_ON(1); + +again: + t->func(t->data); + + /* + * Try to unlock the tasklet. We must use cmpxchg, because + * another CPU might have scheduled or disabled the tasklet. + * We only allow the STATE_RUN -> 0 transition here. + */ + while (!tasklet_tryunlock(t)) { + /* + * If it got disabled meanwhile, bail out: + */ + if (atomic_read(&t->count)) + goto out_disabled; + /* + * If it got scheduled meanwhile, re-execute + * the tasklet function: + */ + if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) + goto again; + if (!--loops) { + printk("hm, tasklet state: %08lx\n", t->state); + WARN_ON(1); + tasklet_unlock(t); + break; + } + } } } +static void tasklet_action(struct softirq_action *a) +{ + struct tasklet_struct *list; + + local_irq_disable(); + list = __get_cpu_var(tasklet_vec).head; + __get_cpu_var(tasklet_vec).head = NULL; + __get_cpu_var(tasklet_vec).tail = &__get_cpu_var(tasklet_vec).head; + local_irq_enable(); + + __tasklet_action(a, list); +} + static void tasklet_hi_action(struct softirq_action *a) { struct tasklet_struct *list; @@ -406,29 +682,7 @@ static void tasklet_hi_action(struct sof __get_cpu_var(tasklet_hi_vec).tail = &__get_cpu_var(tasklet_hi_vec).head; local_irq_enable(); - while (list) { - struct tasklet_struct *t = list; - - list = list->next; - - if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { - if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) - BUG(); - t->func(t->data); - tasklet_unlock(t); - continue; - } - tasklet_unlock(t); - } - - local_irq_disable(); - t->next = NULL; - *__get_cpu_var(tasklet_hi_vec).tail = t; - __get_cpu_var(tasklet_hi_vec).tail = &(t->next); - __raise_softirq_irqoff(HI_SOFTIRQ); - local_irq_enable(); - } + __tasklet_action(a, list); } @@ -451,7 +705,7 @@ void tasklet_kill(struct tasklet_struct while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { do - yield(); + msleep(1); while (test_bit(TASKLET_STATE_SCHED, &t->state)); } tasklet_unlock_wait(t); @@ -496,7 +750,7 @@ static int __try_remote_softirq(struct c cp->flags = 0; cp->priv = softirq; - __smp_call_function_single(cpu, cp); + __smp_call_function_single(cpu, cp, 0); return 0; } return 1; @@ -602,33 +856,98 @@ void __init softirq_init(void) open_softirq(HI_SOFTIRQ, tasklet_hi_action); } -static int ksoftirqd(void * __bind_cpu) +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) + +void tasklet_unlock_wait(struct tasklet_struct *t) { + while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { + /* + * Hack for now to avoid this busy-loop: + */ +#ifdef CONFIG_PREEMPT_RT + msleep(1); +#else + barrier(); +#endif + } +} +EXPORT_SYMBOL(tasklet_unlock_wait); + +#endif + +static int ksoftirqd(void * __data) +{ + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; + struct softirqdata *data = __data; + u32 softirq_mask = (1 << data->nr); + struct softirq_action *h; + int cpu = data->cpu; + +#ifdef CONFIG_PREEMPT_SOFTIRQS + init_waitqueue_head(&data->wait); +#endif + + sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); + current->flags |= PF_SOFTIRQ; set_current_state(TASK_INTERRUPTIBLE); while (!kthread_should_stop()) { preempt_disable(); - if (!local_softirq_pending()) { - preempt_enable_no_resched(); + if (!(local_softirq_pending() & softirq_mask)) { +sleep_more: + __preempt_enable_no_resched(); schedule(); preempt_disable(); } __set_current_state(TASK_RUNNING); - while (local_softirq_pending()) { +#ifdef CONFIG_PREEMPT_SOFTIRQS + data->running = 1; +#endif + + while (local_softirq_pending() & softirq_mask) { /* Preempt disable stops cpu going offline. If already offline, we'll be on wrong CPU: don't process */ - if (cpu_is_offline((long)__bind_cpu)) + if (cpu_is_offline(cpu)) goto wait_to_die; - do_softirq(); - preempt_enable_no_resched(); + + local_irq_disable(); + /* + * Is the softirq already being executed by + * a hardirq context? + */ + if (per_cpu(softirq_running, cpu) & softirq_mask) { + local_irq_enable(); + set_current_state(TASK_INTERRUPTIBLE); + goto sleep_more; + } + per_cpu(softirq_running, cpu) |= softirq_mask; + __preempt_enable_no_resched(); + set_softirq_pending(local_softirq_pending() & ~softirq_mask); + local_bh_disable(); + local_irq_enable(); + + h = &softirq_vec[data->nr]; + if (h) + h->action(h); + rcu_bh_qsctr_inc(data->cpu); + + local_irq_disable(); + per_cpu(softirq_running, cpu) &= ~softirq_mask; + _local_bh_enable(); + local_irq_enable(); + cond_resched(); preempt_disable(); } preempt_enable(); set_current_state(TASK_INTERRUPTIBLE); +#ifdef CONFIG_PREEMPT_SOFTIRQS + data->running = 0; + wake_up(&data->wait); +#endif } __set_current_state(TASK_RUNNING); return 0; @@ -678,7 +997,7 @@ void tasklet_kill_immediate(struct taskl BUG(); } -static void takeover_tasklets(unsigned int cpu) +void takeover_tasklets(unsigned int cpu) { /* CPU is dead, so no lock needed. */ local_irq_disable(); @@ -704,49 +1023,83 @@ static void takeover_tasklets(unsigned i } #endif /* CONFIG_HOTPLUG_CPU */ +static const char *softirq_names [] = +{ + [HI_SOFTIRQ] = "high", + [SCHED_SOFTIRQ] = "sched", + [TIMER_SOFTIRQ] = "timer", + [NET_TX_SOFTIRQ] = "net-tx", + [NET_RX_SOFTIRQ] = "net-rx", + [BLOCK_SOFTIRQ] = "block", + [TASKLET_SOFTIRQ] = "tasklet", +#ifdef CONFIG_HIGH_RES_TIMERS + [HRTIMER_SOFTIRQ] = "hrtimer", +#endif + [RCU_SOFTIRQ] = "rcu", +}; + static int __cpuinit cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { - int hotcpu = (unsigned long)hcpu; + int hotcpu = (unsigned long)hcpu, i; struct task_struct *p; switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: - p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu); - if (IS_ERR(p)) { - printk("ksoftirqd for %i failed\n", hotcpu); - return NOTIFY_BAD; - } - kthread_bind(p, hotcpu); - per_cpu(ksoftirqd, hotcpu) = p; - break; + for (i = 0; i < MAX_SOFTIRQ; i++) { + per_cpu(ksoftirqd, hotcpu)[i].nr = i; + per_cpu(ksoftirqd, hotcpu)[i].cpu = hotcpu; + per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL; + } + for (i = 0; i < MAX_SOFTIRQ; i++) { + p = kthread_create(ksoftirqd, + &per_cpu(ksoftirqd, hotcpu)[i], + "sirq-%s/%d", softirq_names[i], + hotcpu); + if (IS_ERR(p)) { + printk("ksoftirqd %d for %i failed\n", i, + hotcpu); + return NOTIFY_BAD; + } + kthread_bind(p, hotcpu); + per_cpu(ksoftirqd, hotcpu)[i].tsk = p; + } + break; + break; case CPU_ONLINE: case CPU_ONLINE_FROZEN: - wake_up_process(per_cpu(ksoftirqd, hotcpu)); + for (i = 0; i < MAX_SOFTIRQ; i++) + wake_up_process(per_cpu(ksoftirqd, hotcpu)[i].tsk); break; #ifdef CONFIG_HOTPLUG_CPU case CPU_UP_CANCELED: case CPU_UP_CANCELED_FROZEN: - if (!per_cpu(ksoftirqd, hotcpu)) - break; - /* Unbind so it can run. Fall thru. */ - kthread_bind(per_cpu(ksoftirqd, hotcpu), - cpumask_any(cpu_online_mask)); +#if 0 + for (i = 0; i < MAX_SOFTIRQ; i++) { + if (!per_cpu(ksoftirqd, hotcpu)[i].tsk) + continue; + kthread_bind(per_cpu(ksoftirqd, hotcpu)[i].tsk, + any_online_cpu(cpu_online_map)); + } +#endif case CPU_DEAD: case CPU_DEAD_FROZEN: { - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; + struct sched_param param; - p = per_cpu(ksoftirqd, hotcpu); - per_cpu(ksoftirqd, hotcpu) = NULL; - sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m); - kthread_stop(p); + for (i = 0; i < MAX_SOFTIRQ; i++) { + param.sched_priority = MAX_RT_PRIO-1; + p = per_cpu(ksoftirqd, hotcpu)[i].tsk; + sched_setscheduler(p, SCHED_FIFO, ¶m); + per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL; + kthread_stop(p); + } takeover_tasklets(hotcpu); break; - } -#endif /* CONFIG_HOTPLUG_CPU */ } +#endif /* CONFIG_HOTPLUG_CPU */ + } return NOTIFY_OK; } @@ -766,6 +1119,34 @@ static __init int spawn_ksoftirqd(void) } early_initcall(spawn_ksoftirqd); + +#ifdef CONFIG_PREEMPT_SOFTIRQS + +int softirq_preemption = 1; + +EXPORT_SYMBOL(softirq_preemption); + +/* + * Real-Time Preemption depends on softirq threading: + */ +#ifndef CONFIG_PREEMPT_RT + +static int __init softirq_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + softirq_preemption = 0; + else + get_option(&str, &softirq_preemption); + if (!softirq_preemption) + printk("turning off softirq preemption!\n"); + + return 1; +} + +__setup("softirq-preempt=", softirq_preempt_setup); +#endif +#endif + #ifdef CONFIG_SMP /* * Call a function on all processors @@ -795,6 +1176,11 @@ int __init __weak early_irq_init(void) return 0; } +int __init __weak arch_probe_nr_irqs(void) +{ + return 0; +} + int __init __weak arch_early_irq_init(void) { return 0; Index: linux-2.6-tip/kernel/softlockup.c =================================================================== --- linux-2.6-tip.orig/kernel/softlockup.c +++ linux-2.6-tip/kernel/softlockup.c @@ -20,7 +20,7 @@ #include -static DEFINE_SPINLOCK(print_lock); +static DEFINE_RAW_SPINLOCK(print_lock); static DEFINE_PER_CPU(unsigned long, touch_timestamp); static DEFINE_PER_CPU(unsigned long, print_timestamp); @@ -166,97 +166,11 @@ void softlockup_tick(void) } /* - * Have a reasonable limit on the number of tasks checked: - */ -unsigned long __read_mostly sysctl_hung_task_check_count = 1024; - -/* - * Zero means infinite timeout - no checking done: - */ -unsigned long __read_mostly sysctl_hung_task_timeout_secs = 480; - -unsigned long __read_mostly sysctl_hung_task_warnings = 10; - -/* - * Only do the hung-tasks check on one CPU: - */ -static int check_cpu __read_mostly = -1; - -static void check_hung_task(struct task_struct *t, unsigned long now) -{ - unsigned long switch_count = t->nvcsw + t->nivcsw; - - if (t->flags & PF_FROZEN) - return; - - if (switch_count != t->last_switch_count || !t->last_switch_timestamp) { - t->last_switch_count = switch_count; - t->last_switch_timestamp = now; - return; - } - if ((long)(now - t->last_switch_timestamp) < - sysctl_hung_task_timeout_secs) - return; - if (!sysctl_hung_task_warnings) - return; - sysctl_hung_task_warnings--; - - /* - * Ok, the task did not get scheduled for more than 2 minutes, - * complain: - */ - printk(KERN_ERR "INFO: task %s:%d blocked for more than " - "%ld seconds.\n", t->comm, t->pid, - sysctl_hung_task_timeout_secs); - printk(KERN_ERR "\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" - " disables this message.\n"); - sched_show_task(t); - __debug_show_held_locks(t); - - t->last_switch_timestamp = now; - touch_nmi_watchdog(); - - if (softlockup_panic) - panic("softlockup: blocked tasks"); -} - -/* - * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for - * a really long time (120 seconds). If that happens, print out - * a warning. - */ -static void check_hung_uninterruptible_tasks(int this_cpu) -{ - int max_count = sysctl_hung_task_check_count; - unsigned long now = get_timestamp(this_cpu); - struct task_struct *g, *t; - - /* - * If the system crashed already then all bets are off, - * do not report extra hung tasks: - */ - if (test_taint(TAINT_DIE) || did_panic) - return; - - read_lock(&tasklist_lock); - do_each_thread(g, t) { - if (!--max_count) - goto unlock; - /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ - if (t->state == TASK_UNINTERRUPTIBLE) - check_hung_task(t, now); - } while_each_thread(g, t); - unlock: - read_unlock(&tasklist_lock); -} - -/* * The watchdog thread - runs every second and touches the timestamp. */ static int watchdog(void *__bind_cpu) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; - int this_cpu = (long)__bind_cpu; sched_setscheduler(current, SCHED_FIFO, ¶m); @@ -276,11 +190,6 @@ static int watchdog(void *__bind_cpu) if (kthread_should_stop()) break; - if (this_cpu == check_cpu) { - if (sysctl_hung_task_timeout_secs) - check_hung_uninterruptible_tasks(this_cpu); - } - set_current_state(TASK_INTERRUPTIBLE); } __set_current_state(TASK_RUNNING); @@ -312,18 +221,9 @@ cpu_callback(struct notifier_block *nfb, break; case CPU_ONLINE: case CPU_ONLINE_FROZEN: - check_cpu = cpumask_any(cpu_online_mask); wake_up_process(per_cpu(watchdog_task, hotcpu)); break; #ifdef CONFIG_HOTPLUG_CPU - case CPU_DOWN_PREPARE: - case CPU_DOWN_PREPARE_FROZEN: - if (hotcpu == check_cpu) { - /* Pick any other online cpu. */ - check_cpu = cpumask_any_but(cpu_online_mask, hotcpu); - } - break; - case CPU_UP_CANCELED: case CPU_UP_CANCELED_FROZEN: if (!per_cpu(watchdog_task, hotcpu)) Index: linux-2.6-tip/kernel/sys.c =================================================================== --- linux-2.6-tip.orig/kernel/sys.c +++ linux-2.6-tip/kernel/sys.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -32,11 +33,13 @@ #include #include #include +#include #include #include #include #include +#include #include #include @@ -278,6 +281,15 @@ out_unlock: */ void emergency_restart(void) { + /* + * Call the notifier chain if we are not in an + * atomic context: + */ +#ifdef CONFIG_PREEMPT + if (!in_atomic() && !irqs_disabled()) + blocking_notifier_call_chain(&reboot_notifier_list, + SYS_RESTART, NULL); +#endif machine_emergency_restart(); } EXPORT_SYMBOL_GPL(emergency_restart); @@ -1791,6 +1803,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi case PR_SET_TSC: error = SET_TSC_CTL(arg2); break; + case PR_TASK_PERF_COUNTERS_DISABLE: + error = perf_counter_task_disable(); + break; + case PR_TASK_PERF_COUNTERS_ENABLE: + error = perf_counter_task_enable(); + break; case PR_GET_TIMERSLACK: error = current->timer_slack_ns; break; Index: linux-2.6-tip/kernel/sys_ni.c =================================================================== --- linux-2.6-tip.orig/kernel/sys_ni.c +++ linux-2.6-tip/kernel/sys_ni.c @@ -175,3 +175,6 @@ cond_syscall(compat_sys_timerfd_settime) cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); + +/* performance counters: */ +cond_syscall(sys_perf_counter_open); Index: linux-2.6-tip/kernel/sysctl.c =================================================================== --- linux-2.6-tip.orig/kernel/sysctl.c +++ linux-2.6-tip/kernel/sysctl.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -90,6 +91,9 @@ extern int rcutorture_runnable; #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */ /* Constants used for minimum and maximum */ +#if defined(CONFIG_DETECT_HUNG_TASK) || defined(CONFIG_DETECT_SOFTLOCKUP) || defined(CONFIG_HIGHMEM) +static int one = 1; +#endif #ifdef CONFIG_DETECT_SOFTLOCKUP static int sixty = 60; static int neg_one = -1; @@ -100,7 +104,6 @@ static int two = 2; #endif static int zero; -static int one = 1; static unsigned long one_ul = 1; static int one_hundred = 100; @@ -815,6 +818,19 @@ static struct ctl_table kern_table[] = { .extra1 = &neg_one, .extra2 = &sixty, }, +#endif +#ifdef CONFIG_DETECT_HUNG_TASK + { + .ctl_name = CTL_UNNUMBERED, + .procname = "hung_task_panic", + .data = &sysctl_hung_task_panic, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, { .ctl_name = CTL_UNNUMBERED, .procname = "hung_task_check_count", @@ -830,7 +846,7 @@ static struct ctl_table kern_table[] = { .data = &sysctl_hung_task_timeout_secs, .maxlen = sizeof(unsigned long), .mode = 0644, - .proc_handler = &proc_doulongvec_minmax, + .proc_handler = &proc_dohung_task_timeout_secs, .strategy = &sysctl_intvec, }, { @@ -890,6 +906,16 @@ static struct ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif +#ifdef CONFIG_KMEMCHECK + { + .ctl_name = CTL_UNNUMBERED, + .procname = "kmemcheck", + .data = &kmemcheck_enabled, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif #ifdef CONFIG_UNEVICTABLE_LRU { .ctl_name = CTL_UNNUMBERED, Index: linux-2.6-tip/kernel/time/clockevents.c =================================================================== --- linux-2.6-tip.orig/kernel/time/clockevents.c +++ linux-2.6-tip/kernel/time/clockevents.c @@ -27,7 +27,7 @@ static LIST_HEAD(clockevents_released); static RAW_NOTIFIER_HEAD(clockevents_chain); /* Protection for the above */ -static DEFINE_SPINLOCK(clockevents_lock); +static DEFINE_RAW_SPINLOCK(clockevents_lock); /** * clockevents_delta2ns - Convert a latch value (device ticks) to nanoseconds @@ -68,6 +68,17 @@ void clockevents_set_mode(struct clock_e if (dev->mode != mode) { dev->set_mode(mode, dev); dev->mode = mode; + + /* + * A nsec2cyc multiplicator of 0 is invalid and we'd crash + * on it, so fix it up and emit a warning: + */ + if (mode == CLOCK_EVT_MODE_ONESHOT) { + if (unlikely(!dev->mult)) { + dev->mult = 1; + WARN_ON(1); + } + } } } @@ -168,15 +179,6 @@ void clockevents_register_device(struct BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED); BUG_ON(!dev->cpumask); - /* - * A nsec2cyc multiplicator of 0 is invalid and we'd crash - * on it, so fix it up and emit a warning: - */ - if (unlikely(!dev->mult)) { - dev->mult = 1; - WARN_ON(1); - } - spin_lock(&clockevents_lock); list_add(&dev->list, &clockevent_devices); Index: linux-2.6-tip/kernel/time/ntp.c =================================================================== --- linux-2.6-tip.orig/kernel/time/ntp.c +++ linux-2.6-tip/kernel/time/ntp.c @@ -51,6 +51,7 @@ static long ntp_tick_adj; static void ntp_update_frequency(void) { + u64 old_tick_length_base = tick_length_base; u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ) << NTP_SCALE_SHIFT; second_length += (s64)ntp_tick_adj << NTP_SCALE_SHIFT; @@ -60,6 +61,12 @@ static void ntp_update_frequency(void) tick_nsec = div_u64(second_length, HZ) >> NTP_SCALE_SHIFT; tick_length_base = div_u64(tick_length_base, NTP_INTERVAL_FREQ); + + /* + * Don't wait for the next second_overflow, apply + * the change to the tick length immediately + */ + tick_length += tick_length_base - old_tick_length_base; } static void ntp_update_offset(long offset) Index: linux-2.6-tip/kernel/timer.c =================================================================== --- linux-2.6-tip.orig/kernel/timer.c +++ linux-2.6-tip/kernel/timer.c @@ -34,9 +34,11 @@ #include #include #include +#include #include #include #include +#include #include #include @@ -69,6 +71,7 @@ struct tvec_root { struct tvec_base { spinlock_t lock; struct timer_list *running_timer; + wait_queue_head_t wait_for_running_timer; unsigned long timer_jiffies; struct tvec_root tv1; struct tvec tv2; @@ -316,9 +319,7 @@ EXPORT_SYMBOL_GPL(round_jiffies_up_relat static inline void set_running_timer(struct tvec_base *base, struct timer_list *timer) { -#ifdef CONFIG_SMP base->running_timer = timer; -#endif } static void internal_add_timer(struct tvec_base *base, struct timer_list *timer) @@ -491,14 +492,18 @@ static inline void debug_timer_free(stru debug_object_free(timer, &timer_debug_descr); } -static void __init_timer(struct timer_list *timer); - -void init_timer_on_stack(struct timer_list *timer) +static void __init_timer(struct timer_list *timer, + const char *name, + struct lock_class_key *key); + +void init_timer_on_stack_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key) { debug_object_init_on_stack(timer, &timer_debug_descr); - __init_timer(timer); + __init_timer(timer, name, key); } -EXPORT_SYMBOL_GPL(init_timer_on_stack); +EXPORT_SYMBOL_GPL(init_timer_on_stack_key); void destroy_timer_on_stack(struct timer_list *timer) { @@ -512,7 +517,9 @@ static inline void debug_timer_activate( static inline void debug_timer_deactivate(struct timer_list *timer) { } #endif -static void __init_timer(struct timer_list *timer) +static void __init_timer(struct timer_list *timer, + const char *name, + struct lock_class_key *key) { timer->entry.next = NULL; timer->base = __raw_get_cpu_var(tvec_bases); @@ -521,6 +528,7 @@ static void __init_timer(struct timer_li timer->start_pid = -1; memset(timer->start_comm, 0, TASK_COMM_LEN); #endif + lockdep_init_map(&timer->lockdep_map, name, key, 0); } /** @@ -530,19 +538,23 @@ static void __init_timer(struct timer_li * init_timer() must be done to a timer prior calling *any* of the * other timer functions. */ -void init_timer(struct timer_list *timer) +void init_timer_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key) { debug_timer_init(timer); - __init_timer(timer); + __init_timer(timer, name, key); } -EXPORT_SYMBOL(init_timer); +EXPORT_SYMBOL(init_timer_key); -void init_timer_deferrable(struct timer_list *timer) +void init_timer_deferrable_key(struct timer_list *timer, + const char *name, + struct lock_class_key *key) { - init_timer(timer); + init_timer_key(timer, name, key); timer_set_deferrable(timer); } -EXPORT_SYMBOL(init_timer_deferrable); +EXPORT_SYMBOL(init_timer_deferrable_key); static inline void detach_timer(struct timer_list *timer, int clear_pending) @@ -589,11 +601,12 @@ static struct tvec_base *lock_timer_base } } -int __mod_timer(struct timer_list *timer, unsigned long expires) +static inline int +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) { struct tvec_base *base, *new_base; unsigned long flags; - int ret = 0; + int cpu, ret = 0; timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); @@ -603,11 +616,15 @@ int __mod_timer(struct timer_list *timer if (timer_pending(timer)) { detach_timer(timer, 0); ret = 1; + } else { + if (pending_only) + goto out_unlock; } debug_timer_activate(timer); - new_base = __get_cpu_var(tvec_bases); + cpu = raw_smp_processor_id(); + new_base = per_cpu(tvec_bases, cpu); if (base != new_base) { /* @@ -629,42 +646,28 @@ int __mod_timer(struct timer_list *timer timer->expires = expires; internal_add_timer(base, timer); + +out_unlock: spin_unlock_irqrestore(&base->lock, flags); return ret; } -EXPORT_SYMBOL(__mod_timer); - /** - * add_timer_on - start a timer on a particular CPU - * @timer: the timer to be added - * @cpu: the CPU to start it on + * mod_timer_pending - modify a pending timer's timeout + * @timer: the pending timer to be modified + * @expires: new timeout in jiffies * - * This is not very scalable on SMP. Double adds are not possible. + * mod_timer_pending() is the same for pending timers as mod_timer(), + * but will not re-activate and modify already deleted timers. + * + * It is useful for unserialized use of timers. */ -void add_timer_on(struct timer_list *timer, int cpu) +int mod_timer_pending(struct timer_list *timer, unsigned long expires) { - struct tvec_base *base = per_cpu(tvec_bases, cpu); - unsigned long flags; - - timer_stats_timer_set_start_info(timer); - BUG_ON(timer_pending(timer) || !timer->function); - spin_lock_irqsave(&base->lock, flags); - timer_set_base(timer, base); - debug_timer_activate(timer); - internal_add_timer(base, timer); - /* - * Check whether the other CPU is idle and needs to be - * triggered to reevaluate the timer wheel when nohz is - * active. We are protected against the other CPU fiddling - * with the timer by holding the timer base lock. This also - * makes sure that a CPU on the way to idle can not evaluate - * the timer wheel. - */ - wake_up_idle_cpu(cpu); - spin_unlock_irqrestore(&base->lock, flags); + return __mod_timer(timer, expires, true); } +EXPORT_SYMBOL(mod_timer_pending); /** * mod_timer - modify a timer's timeout @@ -688,9 +691,6 @@ void add_timer_on(struct timer_list *tim */ int mod_timer(struct timer_list *timer, unsigned long expires) { - BUG_ON(!timer->function); - - timer_stats_timer_set_start_info(timer); /* * This is a common optimization triggered by the * networking code - if the timer is re-modified @@ -699,12 +699,74 @@ int mod_timer(struct timer_list *timer, if (timer->expires == expires && timer_pending(timer)) return 1; - return __mod_timer(timer, expires); + return __mod_timer(timer, expires, false); } - EXPORT_SYMBOL(mod_timer); /** + * add_timer - start a timer + * @timer: the timer to be added + * + * The kernel will do a ->function(->data) callback from the + * timer interrupt at the ->expires point in the future. The + * current time is 'jiffies'. + * + * The timer's ->expires, ->function (and if the handler uses it, ->data) + * fields must be set prior calling this function. + * + * Timers with an ->expires field in the past will be executed in the next + * timer tick. + */ +void add_timer(struct timer_list *timer) +{ + BUG_ON(timer_pending(timer)); + mod_timer(timer, timer->expires); +} +EXPORT_SYMBOL(add_timer); + +/** + * add_timer_on - start a timer on a particular CPU + * @timer: the timer to be added + * @cpu: the CPU to start it on + * + * This is not very scalable on SMP. Double adds are not possible. + */ +void add_timer_on(struct timer_list *timer, int cpu) +{ + struct tvec_base *base = per_cpu(tvec_bases, cpu); + unsigned long flags; + + timer_stats_timer_set_start_info(timer); + BUG_ON(timer_pending(timer) || !timer->function); + spin_lock_irqsave(&base->lock, flags); + timer_set_base(timer, base); + debug_timer_activate(timer); + internal_add_timer(base, timer); + /* + * Check whether the other CPU is idle and needs to be + * triggered to reevaluate the timer wheel when nohz is + * active. We are protected against the other CPU fiddling + * with the timer by holding the timer base lock. This also + * makes sure that a CPU on the way to idle can not evaluate + * the timer wheel. + */ + wake_up_idle_cpu(cpu); + spin_unlock_irqrestore(&base->lock, flags); +} + +/* + * Wait for a running timer + */ +void wait_for_running_timer(struct timer_list *timer) +{ + struct tvec_base *base = timer->base; + + if (base->running_timer == timer) + wait_event(base->wait_for_running_timer, + base->running_timer != timer); +} + +/** * del_timer - deactive a timer. * @timer: the timer to be deactivated * @@ -733,10 +795,36 @@ int del_timer(struct timer_list *timer) return ret; } - EXPORT_SYMBOL(del_timer); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS) +/* + * This function checks whether a timer is active and not running on any + * CPU. Upon successful (ret >= 0) exit the timer is not queued and the + * handler is not running on any CPU. + * + * It must not be called from interrupt contexts. + */ +int timer_pending_sync(struct timer_list *timer) +{ + struct tvec_base *base; + unsigned long flags; + int ret = -1; + + base = lock_timer_base(timer, &flags); + + if (base->running_timer == timer) + goto out; + + ret = 0; + if (timer_pending(timer)) + ret = 1; +out: + spin_unlock_irqrestore(&base->lock, flags); + + return ret; +} + /** * try_to_del_timer_sync - Try to deactivate a timer * @timer: timer do del @@ -767,7 +855,6 @@ out: return ret; } - EXPORT_SYMBOL(try_to_del_timer_sync); /** @@ -789,14 +876,22 @@ EXPORT_SYMBOL(try_to_del_timer_sync); */ int del_timer_sync(struct timer_list *timer) { +#ifdef CONFIG_LOCKDEP + unsigned long flags; + + local_irq_save(flags); + lock_map_acquire(&timer->lockdep_map); + lock_map_release(&timer->lockdep_map); + local_irq_restore(flags); +#endif + for (;;) { int ret = try_to_del_timer_sync(timer); if (ret >= 0) return ret; - cpu_relax(); + wait_for_running_timer(timer); } } - EXPORT_SYMBOL(del_timer_sync); #endif @@ -839,6 +934,20 @@ static inline void __run_timers(struct t struct list_head *head = &work_list; int index = base->timer_jiffies & TVR_MASK; + if (softirq_need_resched()) { + spin_unlock_irq(&base->lock); + wake_up(&base->wait_for_running_timer); + cond_resched_softirq_context(); + cpu_relax(); + spin_lock_irq(&base->lock); + /* + * We can simply continue after preemption, nobody + * else can touch timer_jiffies so 'index' is still + * valid. Any new jiffy will be taken care of in + * subsequent loops: + */ + } + /* * Cascade timers: */ @@ -861,23 +970,48 @@ static inline void __run_timers(struct t set_running_timer(base, timer); detach_timer(timer, 1); + spin_unlock_irq(&base->lock); { int preempt_count = preempt_count(); + +#ifdef CONFIG_LOCKDEP + /* + * It is permissible to free the timer from + * inside the function that is called from + * it, this we need to take into account for + * lockdep too. To avoid bogus "held lock + * freed" warnings as well as problems when + * looking into timer->lockdep_map, make a + * copy and use that here. + */ + struct lockdep_map lockdep_map = + timer->lockdep_map; +#endif + /* + * Couple the lock chain with the lock chain at + * del_timer_sync() by acquiring the lock_map + * around the fn() call here and in + * del_timer_sync(). + */ + lock_map_acquire(&lockdep_map); + fn(data); + + lock_map_release(&lockdep_map); + if (preempt_count != preempt_count()) { - printk(KERN_ERR "huh, entered %p " - "with preempt_count %08x, exited" - " with %08x?\n", - fn, preempt_count, - preempt_count()); - BUG(); + print_symbol("BUG: unbalanced timer-handler preempt count in %s!\n", (unsigned long) fn); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + preempt_count() = preempt_count; } } + set_running_timer(base, NULL); + cond_resched_softirq_context(); spin_lock_irq(&base->lock); } } - set_running_timer(base, NULL); + wake_up(&base->wait_for_running_timer); spin_unlock_irq(&base->lock); } @@ -1007,9 +1141,22 @@ unsigned long get_next_timer_interrupt(u struct tvec_base *base = __get_cpu_var(tvec_bases); unsigned long expires; +#ifdef CONFIG_PREEMPT_RT + /* + * On PREEMPT_RT we cannot sleep here. If the trylock does not + * succeed then we return the worst-case 'expires in 1 tick' + * value: + */ + if (spin_trylock(&base->lock)) { + expires = __next_timer_interrupt(base); + spin_unlock(&base->lock); + } else + expires = now + 1; +#else spin_lock(&base->lock); expires = __next_timer_interrupt(base); spin_unlock(&base->lock); +#endif if (time_before_eq(expires, now)) return now; @@ -1029,11 +1176,10 @@ void update_process_times(int user_tick) /* Note: this timer irq context must be accounted for as well. */ account_process_tick(p, user_tick); + scheduler_tick(); run_local_timers(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_tick); - printk_tick(); - scheduler_tick(); run_posix_cpu_timers(p); } @@ -1042,9 +1188,30 @@ void update_process_times(int user_tick) */ static unsigned long count_active_tasks(void) { + /* + * On PREEMPT_RT, we are running in the loadavg thread, + * so consider 1 less running tasks: + */ +#ifdef CONFIG_PREEMPT_RT + return (nr_active() - 1) * FIXED_1; +#else return nr_active() * FIXED_1; +#endif } +#ifdef CONFIG_PREEMPT_RT +/* + * Nr of active tasks - counted in fixed-point numbers + */ +static unsigned long count_active_rt_tasks(void) +{ + extern unsigned long rt_nr_running(void); + extern unsigned long rt_nr_uninterruptible(void); + + return (rt_nr_running() + rt_nr_uninterruptible()) * FIXED_1; +} +#endif + /* * Hmm.. Changed this, as the GNU make sources (load.c) seems to * imply that avenrun[] is the standard name for this kind of thing. @@ -1057,6 +1224,8 @@ unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); +unsigned long avenrun_rt[3]; + /* * calc_load - given tick count, update the avenrun load estimates. * This is called while holding a write_lock on xtime_lock. @@ -1064,33 +1233,76 @@ EXPORT_SYMBOL(avenrun); static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ +#ifdef CONFIG_PREEMPT_RT + unsigned long active_rt_tasks; /* fixed-point */ +#endif static int count = LOAD_FREQ; count -= ticks; if (unlikely(count < 0)) { active_tasks = count_active_tasks(); +#ifdef CONFIG_PREEMPT_RT + active_rt_tasks = count_active_rt_tasks(); +#endif do { CALC_LOAD(avenrun[0], EXP_1, active_tasks); CALC_LOAD(avenrun[1], EXP_5, active_tasks); CALC_LOAD(avenrun[2], EXP_15, active_tasks); +#ifdef CONFIG_PREEMPT_RT + CALC_LOAD(avenrun_rt[0], EXP_1, active_tasks); + CALC_LOAD(avenrun_rt[1], EXP_5, active_tasks); + CALC_LOAD(avenrun_rt[2], EXP_15, active_tasks); +#endif count += LOAD_FREQ; + } while (count < 0); } } -/* - * This function runs timers and the timer-tq in bottom half context. - */ -static void run_timer_softirq(struct softirq_action *h) +#ifdef CONFIG_PREEMPT_RT +static int loadavg_calculator(void *data) { - struct tvec_base *base = __get_cpu_var(tvec_bases); + unsigned long now, last; - hrtimer_run_pending(); + last = jiffies; + while (!kthread_should_stop()) { + struct timespec delay = { + .tv_sec = LOAD_FREQ / HZ, + .tv_nsec = 0 + }; + + hrtimer_nanosleep(&delay, NULL, HRTIMER_MODE_REL, + CLOCK_MONOTONIC); + now = jiffies; + write_seqlock_irq(&xtime_lock); + calc_load(now - last); + write_sequnlock_irq(&xtime_lock); + last = now; + } - if (time_after_eq(jiffies, base->timer_jiffies)) - __run_timers(base); + return 0; } +static int __init start_loadavg_calculator(void) +{ + struct task_struct *p; + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; + + p = kthread_create(loadavg_calculator, NULL, "loadavg"); + if (IS_ERR(p)) { + printk(KERN_ERR "Could not create the loadavg thread.\n"); + return 1; + } + + sched_setscheduler(p, SCHED_FIFO, ¶m); + wake_up_process(p); + + return 0; +} + +late_initcall(start_loadavg_calculator); +#endif + /* * Called by the local, per-CPU timer interrupt on SMP. */ @@ -1102,13 +1314,45 @@ void run_local_timers(void) } /* - * Called by the timer interrupt. xtime_lock must already be taken - * by the timer IRQ! + * Time of day handling: */ -static inline void update_times(unsigned long ticks) +static inline void update_times(void) { - update_wall_time(); - calc_load(ticks); + static unsigned long last_tick = INITIAL_JIFFIES; + unsigned long ticks, flags; + + /* + * Dont take the xtime_lock from every CPU in + * every tick - only when needed: + */ + if (jiffies == last_tick) + return; + + write_seqlock_irqsave(&xtime_lock, flags); + ticks = jiffies - last_tick; + if (ticks) { + last_tick += ticks; +#ifndef CONFIG_PREEMPT_RT + calc_load(ticks); +#endif + } + write_sequnlock_irqrestore(&xtime_lock, flags); +} + + +/* + * This function runs timers and the timer-tq in bottom half context. + */ +static void run_timer_softirq(struct softirq_action *h) +{ + struct tvec_base *base = per_cpu(tvec_bases, raw_smp_processor_id()); + + printk_tick(); + update_times(); + hrtimer_run_pending(); + + if (time_after_eq(jiffies, base->timer_jiffies)) + __run_timers(base); } /* @@ -1120,7 +1364,7 @@ static inline void update_times(unsigned void do_timer(unsigned long ticks) { jiffies_64 += ticks; - update_times(ticks); + update_wall_time(); } #ifdef __ARCH_WANT_SYS_ALARM @@ -1268,7 +1512,7 @@ signed long __sched schedule_timeout(sig expire = timeout + jiffies; setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); - __mod_timer(&timer, expire); + __mod_timer(&timer, expire, false); schedule(); del_singleshot_timer_sync(&timer); @@ -1454,6 +1698,7 @@ static int __cpuinit init_timers_cpu(int } spin_lock_init(&base->lock); + init_waitqueue_head(&base->wait_for_running_timer); for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); @@ -1485,6 +1730,7 @@ static void __cpuinit migrate_timers(int { struct tvec_base *old_base; struct tvec_base *new_base; + unsigned long flags; int i; BUG_ON(cpu_online(cpu)); @@ -1494,8 +1740,11 @@ static void __cpuinit migrate_timers(int * The caller is globally serialized and nobody else * takes two locks at once, deadlock is not possible. */ - spin_lock_irq(&new_base->lock); - spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING); + local_irq_save(flags); + while (!spin_trylock(&new_base->lock)) + cpu_relax(); + while (!spin_trylock(&old_base->lock)) + cpu_relax(); BUG_ON(old_base->running_timer); @@ -1509,7 +1758,9 @@ static void __cpuinit migrate_timers(int } spin_unlock(&old_base->lock); - spin_unlock_irq(&new_base->lock); + spin_unlock(&new_base->lock); + local_irq_restore(flags); + put_cpu_var(tvec_bases); } #endif /* CONFIG_HOTPLUG_CPU */ Index: linux-2.6-tip/kernel/trace/Kconfig =================================================================== --- linux-2.6-tip.orig/kernel/trace/Kconfig +++ linux-2.6-tip/kernel/trace/Kconfig @@ -9,6 +9,9 @@ config USER_STACKTRACE_SUPPORT config NOP_TRACER bool +config HAVE_FTRACE_NMI_ENTER + bool + config HAVE_FUNCTION_TRACER bool @@ -37,6 +40,11 @@ config TRACER_MAX_TRACE config RING_BUFFER bool +config FTRACE_NMI_ENTER + bool + depends on HAVE_FTRACE_NMI_ENTER + default y + config TRACING bool select DEBUG_FS @@ -127,6 +135,7 @@ config SYSPROF_TRACER bool "Sysprof Tracer" depends on X86 select TRACING + select CONTEXT_SWITCH_TRACER help This tracer provides the trace needed by the 'Sysprof' userspace tool. @@ -165,9 +174,8 @@ config BOOT_TRACER representation of the delays during initcalls - but the raw /debug/tracing/trace text output is readable too. - ( Note that tracing self tests can't be enabled if this tracer is - selected, because the self-tests are an initcall as well and that - would invalidate the boot trace. ) + You must pass in ftrace=initcall to the kernel command line + to enable this on bootup. config TRACE_BRANCH_PROFILING bool "Trace likely/unlikely profiler" @@ -266,6 +274,62 @@ config HW_BRANCH_TRACER This tracer records all branches on the system in a circular buffer giving access to the last N branches for each cpu. +config KMEMTRACE + bool "Trace SLAB allocations" + select TRACING + help + kmemtrace provides tracing for slab allocator functions, such as + kmalloc, kfree, kmem_cache_alloc, kmem_cache_free etc.. Collected + data is then fed to the userspace application in order to analyse + allocation hotspots, internal fragmentation and so on, making it + possible to see how well an allocator performs, as well as debug + and profile kernel code. + + This requires an userspace application to use. See + Documentation/vm/kmemtrace.txt for more information. + + Saying Y will make the kernel somewhat larger and slower. However, + if you disable kmemtrace at run-time or boot-time, the performance + impact is minimal (depending on the arch the kernel is built for). + + If unsure, say N. + +config WORKQUEUE_TRACER + bool "Trace workqueues" if !PREEMPT_RT + select TRACING + help + The workqueue tracer provides some statistical informations + about each cpu workqueue thread such as the number of the + works inserted and executed since their creation. It can help + to evaluate the amount of work each of them have to perform. + For example it can help a developer to decide whether he should + choose a per cpu workqueue instead of a singlethreaded one. + +config BLK_DEV_IO_TRACE + bool "Support for tracing block io actions" + depends on SYSFS + depends on BLOCK + select RELAY + select DEBUG_FS + select TRACEPOINTS + select TRACING + select STACKTRACE if STACKTRACE_SUPPORT + help + Say Y here if you want to be able to trace the block layer actions + on a given queue. Tracing allows you to see any traffic happening + on a block device queue. For more information (and the userspace + support tools needed), fetch the blktrace tools from: + + git://git.kernel.dk/blktrace.git + + Tracing also is possible using the ftrace interface, e.g.: + + echo 1 > /sys/block/sda/sda1/trace/enable + echo blk > /sys/kernel/debug/tracing/current_tracer + cat /sys/kernel/debug/tracing/trace_pipe + + If unsure, say N. + config DYNAMIC_FTRACE bool "enable/disable ftrace tracepoints dynamically" depends on FUNCTION_TRACER @@ -296,7 +360,7 @@ config FTRACE_SELFTEST config FTRACE_STARTUP_TEST bool "Perform a startup test on ftrace" - depends on TRACING && DEBUG_KERNEL && !BOOT_TRACER + depends on TRACING && DEBUG_KERNEL select FTRACE_SELFTEST help This option performs a series of startup tests on ftrace. On bootup Index: linux-2.6-tip/kernel/trace/Makefile =================================================================== --- linux-2.6-tip.orig/kernel/trace/Makefile +++ linux-2.6-tip/kernel/trace/Makefile @@ -19,6 +19,8 @@ obj-$(CONFIG_FUNCTION_TRACER) += libftra obj-$(CONFIG_RING_BUFFER) += ring_buffer.o obj-$(CONFIG_TRACING) += trace.o +obj-$(CONFIG_TRACING) += trace_output.o +obj-$(CONFIG_TRACING) += trace_stat.o obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o obj-$(CONFIG_SYSPROF_TRACER) += trace_sysprof.o obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o @@ -33,5 +35,8 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += t obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o obj-$(CONFIG_HW_BRANCH_TRACER) += trace_hw_branches.o obj-$(CONFIG_POWER_TRACER) += trace_power.o +obj-$(CONFIG_KMEMTRACE) += kmemtrace.o +obj-$(CONFIG_WORKQUEUE_TRACER) += trace_workqueue.o +obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o libftrace-y := ftrace.o Index: linux-2.6-tip/kernel/trace/blktrace.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/blktrace.c @@ -0,0 +1,1538 @@ +/* + * Copyright (C) 2006 Jens Axboe + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "trace_output.h" + +static unsigned int blktrace_seq __read_mostly = 1; + +static struct trace_array *blk_tr; +static int __read_mostly blk_tracer_enabled; + +/* Select an alternative, minimalistic output than the original one */ +#define TRACE_BLK_OPT_CLASSIC 0x1 + +static struct tracer_opt blk_tracer_opts[] = { + /* Default disable the minimalistic output */ + { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) }, + { } +}; + +static struct tracer_flags blk_tracer_flags = { + .val = 0, + .opts = blk_tracer_opts, +}; + +/* Global reference count of probes */ +static DEFINE_MUTEX(blk_probe_mutex); +static atomic_t blk_probes_ref = ATOMIC_INIT(0); + +static int blk_register_tracepoints(void); +static void blk_unregister_tracepoints(void); + +/* + * Send out a notify message. + */ +static void trace_note(struct blk_trace *bt, pid_t pid, int action, + const void *data, size_t len) +{ + struct blk_io_trace *t; + + if (!bt->rchan) + return; + + t = relay_reserve(bt->rchan, sizeof(*t) + len); + if (t) { + const int cpu = smp_processor_id(); + + t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION; + t->time = ktime_to_ns(ktime_get()); + t->device = bt->dev; + t->action = action; + t->pid = pid; + t->cpu = cpu; + t->pdu_len = len; + memcpy((void *) t + sizeof(*t), data, len); + } +} + +/* + * Send out a notify for this process, if we haven't done so since a trace + * started + */ +static void trace_note_tsk(struct blk_trace *bt, struct task_struct *tsk) +{ + tsk->btrace_seq = blktrace_seq; + trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm, sizeof(tsk->comm)); +} + +static void trace_note_time(struct blk_trace *bt) +{ + struct timespec now; + unsigned long flags; + u32 words[2]; + + getnstimeofday(&now); + words[0] = now.tv_sec; + words[1] = now.tv_nsec; + + local_irq_save(flags); + trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words)); + local_irq_restore(flags); +} + +void __trace_note_message(struct blk_trace *bt, const char *fmt, ...) +{ + int n; + va_list args; + unsigned long flags; + char *buf; + + if (blk_tr) { + va_start(args, fmt); + ftrace_vprintk(fmt, args); + va_end(args); + return; + } + + if (!bt->msg_data) + return; + + local_irq_save(flags); + buf = per_cpu_ptr(bt->msg_data, smp_processor_id()); + va_start(args, fmt); + n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args); + va_end(args); + + trace_note(bt, 0, BLK_TN_MESSAGE, buf, n); + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(__trace_note_message); + +static int act_log_check(struct blk_trace *bt, u32 what, sector_t sector, + pid_t pid) +{ + if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0) + return 1; + if (sector < bt->start_lba || sector > bt->end_lba) + return 1; + if (bt->pid && pid != bt->pid) + return 1; + + return 0; +} + +/* + * Data direction bit lookup + */ +static u32 ddir_act[2] __read_mostly = { BLK_TC_ACT(BLK_TC_READ), + BLK_TC_ACT(BLK_TC_WRITE) }; + +/* The ilog2() calls fall out because they're constant */ +#define MASK_TC_BIT(rw, __name) ((rw & (1 << BIO_RW_ ## __name)) << \ + (ilog2(BLK_TC_ ## __name) + BLK_TC_SHIFT - BIO_RW_ ## __name)) + +/* + * The worker for the various blk_add_trace*() types. Fills out a + * blk_io_trace structure and places it in a per-cpu subbuffer. + */ +static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes, + int rw, u32 what, int error, int pdu_len, void *pdu_data) +{ + struct task_struct *tsk = current; + struct ring_buffer_event *event = NULL; + struct blk_io_trace *t; + unsigned long flags = 0; + unsigned long *sequence; + pid_t pid; + int cpu, pc = 0; + + if (unlikely(bt->trace_state != Blktrace_running || + !blk_tracer_enabled)) + return; + + what |= ddir_act[rw & WRITE]; + what |= MASK_TC_BIT(rw, BARRIER); + what |= MASK_TC_BIT(rw, SYNCIO); + what |= MASK_TC_BIT(rw, AHEAD); + what |= MASK_TC_BIT(rw, META); + what |= MASK_TC_BIT(rw, DISCARD); + + pid = tsk->pid; + if (unlikely(act_log_check(bt, what, sector, pid))) + return; + cpu = raw_smp_processor_id(); + + if (blk_tr) { + tracing_record_cmdline(current); + + pc = preempt_count(); + event = trace_buffer_lock_reserve(blk_tr, TRACE_BLK, + sizeof(*t) + pdu_len, + 0, pc); + if (!event) + return; + t = ring_buffer_event_data(event); + goto record_it; + } + + /* + * A word about the locking here - we disable interrupts to reserve + * some space in the relay per-cpu buffer, to prevent an irq + * from coming in and stepping on our toes. + */ + local_irq_save(flags); + + if (unlikely(tsk->btrace_seq != blktrace_seq)) + trace_note_tsk(bt, tsk); + + t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len); + if (t) { + sequence = per_cpu_ptr(bt->sequence, cpu); + + t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION; + t->sequence = ++(*sequence); + t->time = ktime_to_ns(ktime_get()); +record_it: + /* + * These two are not needed in ftrace as they are in the + * generic trace_entry, filled by tracing_generic_entry_update, + * but for the trace_event->bin() synthesizer benefit we do it + * here too. + */ + t->cpu = cpu; + t->pid = pid; + + t->sector = sector; + t->bytes = bytes; + t->action = what; + t->device = bt->dev; + t->error = error; + t->pdu_len = pdu_len; + + if (pdu_len) + memcpy((void *) t + sizeof(*t), pdu_data, pdu_len); + + if (blk_tr) { + trace_buffer_unlock_commit(blk_tr, event, 0, pc); + return; + } + } + + local_irq_restore(flags); +} + +static struct dentry *blk_tree_root; +static DEFINE_MUTEX(blk_tree_mutex); + +static void blk_trace_cleanup(struct blk_trace *bt) +{ + debugfs_remove(bt->msg_file); + debugfs_remove(bt->dropped_file); + relay_close(bt->rchan); + free_percpu(bt->sequence); + free_percpu(bt->msg_data); + kfree(bt); + mutex_lock(&blk_probe_mutex); + if (atomic_dec_and_test(&blk_probes_ref)) + blk_unregister_tracepoints(); + mutex_unlock(&blk_probe_mutex); +} + +int blk_trace_remove(struct request_queue *q) +{ + struct blk_trace *bt; + + bt = xchg(&q->blk_trace, NULL); + if (!bt) + return -EINVAL; + + if (bt->trace_state == Blktrace_setup || + bt->trace_state == Blktrace_stopped) + blk_trace_cleanup(bt); + + return 0; +} +EXPORT_SYMBOL_GPL(blk_trace_remove); + +static int blk_dropped_open(struct inode *inode, struct file *filp) +{ + filp->private_data = inode->i_private; + + return 0; +} + +static ssize_t blk_dropped_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + struct blk_trace *bt = filp->private_data; + char buf[16]; + + snprintf(buf, sizeof(buf), "%u\n", atomic_read(&bt->dropped)); + + return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf)); +} + +static const struct file_operations blk_dropped_fops = { + .owner = THIS_MODULE, + .open = blk_dropped_open, + .read = blk_dropped_read, +}; + +static int blk_msg_open(struct inode *inode, struct file *filp) +{ + filp->private_data = inode->i_private; + + return 0; +} + +static ssize_t blk_msg_write(struct file *filp, const char __user *buffer, + size_t count, loff_t *ppos) +{ + char *msg; + struct blk_trace *bt; + + if (count > BLK_TN_MAX_MSG) + return -EINVAL; + + msg = kmalloc(count, GFP_KERNEL); + if (msg == NULL) + return -ENOMEM; + + if (copy_from_user(msg, buffer, count)) { + kfree(msg); + return -EFAULT; + } + + bt = filp->private_data; + __trace_note_message(bt, "%s", msg); + kfree(msg); + + return count; +} + +static const struct file_operations blk_msg_fops = { + .owner = THIS_MODULE, + .open = blk_msg_open, + .write = blk_msg_write, +}; + +/* + * Keep track of how many times we encountered a full subbuffer, to aid + * the user space app in telling how many lost events there were. + */ +static int blk_subbuf_start_callback(struct rchan_buf *buf, void *subbuf, + void *prev_subbuf, size_t prev_padding) +{ + struct blk_trace *bt; + + if (!relay_buf_full(buf)) + return 1; + + bt = buf->chan->private_data; + atomic_inc(&bt->dropped); + return 0; +} + +static int blk_remove_buf_file_callback(struct dentry *dentry) +{ + struct dentry *parent = dentry->d_parent; + debugfs_remove(dentry); + + /* + * this will fail for all but the last file, but that is ok. what we + * care about is the top level buts->name directory going away, when + * the last trace file is gone. Then we don't have to rmdir() that + * manually on trace stop, so it nicely solves the issue with + * force killing of running traces. + */ + + debugfs_remove(parent); + return 0; +} + +static struct dentry *blk_create_buf_file_callback(const char *filename, + struct dentry *parent, + int mode, + struct rchan_buf *buf, + int *is_global) +{ + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); +} + +static struct rchan_callbacks blk_relay_callbacks = { + .subbuf_start = blk_subbuf_start_callback, + .create_buf_file = blk_create_buf_file_callback, + .remove_buf_file = blk_remove_buf_file_callback, +}; + +/* + * Setup everything required to start tracing + */ +int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, + struct blk_user_trace_setup *buts) +{ + struct blk_trace *old_bt, *bt = NULL; + struct dentry *dir = NULL; + int ret, i; + + if (!buts->buf_size || !buts->buf_nr) + return -EINVAL; + + strncpy(buts->name, name, BLKTRACE_BDEV_SIZE); + buts->name[BLKTRACE_BDEV_SIZE - 1] = '\0'; + + /* + * some device names have larger paths - convert the slashes + * to underscores for this to work as expected + */ + for (i = 0; i < strlen(buts->name); i++) + if (buts->name[i] == '/') + buts->name[i] = '_'; + + ret = -ENOMEM; + bt = kzalloc(sizeof(*bt), GFP_KERNEL); + if (!bt) + goto err; + + bt->sequence = alloc_percpu(unsigned long); + if (!bt->sequence) + goto err; + + bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG); + if (!bt->msg_data) + goto err; + + ret = -ENOENT; + + if (!blk_tree_root) { + blk_tree_root = debugfs_create_dir("block", NULL); + if (!blk_tree_root) + return -ENOMEM; + } + + dir = debugfs_create_dir(buts->name, blk_tree_root); + + if (!dir) + goto err; + + bt->dir = dir; + bt->dev = dev; + atomic_set(&bt->dropped, 0); + + ret = -EIO; + bt->dropped_file = debugfs_create_file("dropped", 0444, dir, bt, + &blk_dropped_fops); + if (!bt->dropped_file) + goto err; + + bt->msg_file = debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops); + if (!bt->msg_file) + goto err; + + bt->rchan = relay_open("trace", dir, buts->buf_size, + buts->buf_nr, &blk_relay_callbacks, bt); + if (!bt->rchan) + goto err; + + bt->act_mask = buts->act_mask; + if (!bt->act_mask) + bt->act_mask = (u16) -1; + + bt->start_lba = buts->start_lba; + bt->end_lba = buts->end_lba; + if (!bt->end_lba) + bt->end_lba = -1ULL; + + bt->pid = buts->pid; + bt->trace_state = Blktrace_setup; + + mutex_lock(&blk_probe_mutex); + if (atomic_add_return(1, &blk_probes_ref) == 1) { + ret = blk_register_tracepoints(); + if (ret) + goto probe_err; + } + mutex_unlock(&blk_probe_mutex); + + ret = -EBUSY; + old_bt = xchg(&q->blk_trace, bt); + if (old_bt) { + (void) xchg(&q->blk_trace, old_bt); + goto err; + } + + return 0; +probe_err: + atomic_dec(&blk_probes_ref); + mutex_unlock(&blk_probe_mutex); +err: + if (bt) { + if (bt->msg_file) + debugfs_remove(bt->msg_file); + if (bt->dropped_file) + debugfs_remove(bt->dropped_file); + free_percpu(bt->sequence); + free_percpu(bt->msg_data); + if (bt->rchan) + relay_close(bt->rchan); + kfree(bt); + } + return ret; +} + +int blk_trace_setup(struct request_queue *q, char *name, dev_t dev, + char __user *arg) +{ + struct blk_user_trace_setup buts; + int ret; + + ret = copy_from_user(&buts, arg, sizeof(buts)); + if (ret) + return -EFAULT; + + ret = do_blk_trace_setup(q, name, dev, &buts); + if (ret) + return ret; + + if (copy_to_user(arg, &buts, sizeof(buts))) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL_GPL(blk_trace_setup); + +int blk_trace_startstop(struct request_queue *q, int start) +{ + int ret; + struct blk_trace *bt = q->blk_trace; + + if (bt == NULL) + return -EINVAL; + + /* + * For starting a trace, we can transition from a setup or stopped + * trace. For stopping a trace, the state must be running + */ + ret = -EINVAL; + if (start) { + if (bt->trace_state == Blktrace_setup || + bt->trace_state == Blktrace_stopped) { + blktrace_seq++; + smp_mb(); + bt->trace_state = Blktrace_running; + + trace_note_time(bt); + ret = 0; + } + } else { + if (bt->trace_state == Blktrace_running) { + bt->trace_state = Blktrace_stopped; + relay_flush(bt->rchan); + ret = 0; + } + } + + return ret; +} +EXPORT_SYMBOL_GPL(blk_trace_startstop); + +/** + * blk_trace_ioctl: - handle the ioctls associated with tracing + * @bdev: the block device + * @cmd: the ioctl cmd + * @arg: the argument data, if any + * + **/ +int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) +{ + struct request_queue *q; + int ret, start = 0; + char b[BDEVNAME_SIZE]; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + mutex_lock(&bdev->bd_mutex); + + switch (cmd) { + case BLKTRACESETUP: + bdevname(bdev, b); + ret = blk_trace_setup(q, b, bdev->bd_dev, arg); + break; + case BLKTRACESTART: + start = 1; + case BLKTRACESTOP: + ret = blk_trace_startstop(q, start); + break; + case BLKTRACETEARDOWN: + ret = blk_trace_remove(q); + break; + default: + ret = -ENOTTY; + break; + } + + mutex_unlock(&bdev->bd_mutex); + return ret; +} + +/** + * blk_trace_shutdown: - stop and cleanup trace structures + * @q: the request queue associated with the device + * + **/ +void blk_trace_shutdown(struct request_queue *q) +{ + if (q->blk_trace) { + blk_trace_startstop(q, 0); + blk_trace_remove(q); + } +} + +/* + * blktrace probes + */ + +/** + * blk_add_trace_rq - Add a trace for a request oriented action + * @q: queue the io is for + * @rq: the source request + * @what: the action + * + * Description: + * Records an action against a request. Will log the bio offset + size. + * + **/ +static void blk_add_trace_rq(struct request_queue *q, struct request *rq, + u32 what) +{ + struct blk_trace *bt = q->blk_trace; + int rw = rq->cmd_flags & 0x03; + + if (likely(!bt)) + return; + + if (blk_discard_rq(rq)) + rw |= (1 << BIO_RW_DISCARD); + + if (blk_pc_request(rq)) { + what |= BLK_TC_ACT(BLK_TC_PC); + __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, + sizeof(rq->cmd), rq->cmd); + } else { + what |= BLK_TC_ACT(BLK_TC_FS); + __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, + rw, what, rq->errors, 0, NULL); + } +} + +static void blk_add_trace_rq_abort(struct request_queue *q, struct request *rq) +{ + blk_add_trace_rq(q, rq, BLK_TA_ABORT); +} + +static void blk_add_trace_rq_insert(struct request_queue *q, struct request *rq) +{ + blk_add_trace_rq(q, rq, BLK_TA_INSERT); +} + +static void blk_add_trace_rq_issue(struct request_queue *q, struct request *rq) +{ + blk_add_trace_rq(q, rq, BLK_TA_ISSUE); +} + +static void blk_add_trace_rq_requeue(struct request_queue *q, + struct request *rq) +{ + blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); +} + +static void blk_add_trace_rq_complete(struct request_queue *q, + struct request *rq) +{ + blk_add_trace_rq(q, rq, BLK_TA_COMPLETE); +} + +/** + * blk_add_trace_bio - Add a trace for a bio oriented action + * @q: queue the io is for + * @bio: the source bio + * @what: the action + * + * Description: + * Records an action against a bio. Will log the bio offset + size. + * + **/ +static void blk_add_trace_bio(struct request_queue *q, struct bio *bio, + u32 what) +{ + struct blk_trace *bt = q->blk_trace; + + if (likely(!bt)) + return; + + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, + !bio_flagged(bio, BIO_UPTODATE), 0, NULL); +} + +static void blk_add_trace_bio_bounce(struct request_queue *q, struct bio *bio) +{ + blk_add_trace_bio(q, bio, BLK_TA_BOUNCE); +} + +static void blk_add_trace_bio_complete(struct request_queue *q, struct bio *bio) +{ + blk_add_trace_bio(q, bio, BLK_TA_COMPLETE); +} + +static void blk_add_trace_bio_backmerge(struct request_queue *q, + struct bio *bio) +{ + blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); +} + +static void blk_add_trace_bio_frontmerge(struct request_queue *q, + struct bio *bio) +{ + blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); +} + +static void blk_add_trace_bio_queue(struct request_queue *q, struct bio *bio) +{ + blk_add_trace_bio(q, bio, BLK_TA_QUEUE); +} + +static void blk_add_trace_getrq(struct request_queue *q, + struct bio *bio, int rw) +{ + if (bio) + blk_add_trace_bio(q, bio, BLK_TA_GETRQ); + else { + struct blk_trace *bt = q->blk_trace; + + if (bt) + __blk_add_trace(bt, 0, 0, rw, BLK_TA_GETRQ, 0, 0, NULL); + } +} + + +static void blk_add_trace_sleeprq(struct request_queue *q, + struct bio *bio, int rw) +{ + if (bio) + blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ); + else { + struct blk_trace *bt = q->blk_trace; + + if (bt) + __blk_add_trace(bt, 0, 0, rw, BLK_TA_SLEEPRQ, + 0, 0, NULL); + } +} + +static void blk_add_trace_plug(struct request_queue *q) +{ + struct blk_trace *bt = q->blk_trace; + + if (bt) + __blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL); +} + +static void blk_add_trace_unplug_io(struct request_queue *q) +{ + struct blk_trace *bt = q->blk_trace; + + if (bt) { + unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; + __be64 rpdu = cpu_to_be64(pdu); + + __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0, + sizeof(rpdu), &rpdu); + } +} + +static void blk_add_trace_unplug_timer(struct request_queue *q) +{ + struct blk_trace *bt = q->blk_trace; + + if (bt) { + unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; + __be64 rpdu = cpu_to_be64(pdu); + + __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0, + sizeof(rpdu), &rpdu); + } +} + +static void blk_add_trace_split(struct request_queue *q, struct bio *bio, + unsigned int pdu) +{ + struct blk_trace *bt = q->blk_trace; + + if (bt) { + __be64 rpdu = cpu_to_be64(pdu); + + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, + BLK_TA_SPLIT, !bio_flagged(bio, BIO_UPTODATE), + sizeof(rpdu), &rpdu); + } +} + +/** + * blk_add_trace_remap - Add a trace for a remap operation + * @q: queue the io is for + * @bio: the source bio + * @dev: target device + * @from: source sector + * @to: target sector + * + * Description: + * Device mapper or raid target sometimes need to split a bio because + * it spans a stripe (or similar). Add a trace for that action. + * + **/ +static void blk_add_trace_remap(struct request_queue *q, struct bio *bio, + dev_t dev, sector_t from, sector_t to) +{ + struct blk_trace *bt = q->blk_trace; + struct blk_io_trace_remap r; + + if (likely(!bt)) + return; + + r.device = cpu_to_be32(dev); + r.device_from = cpu_to_be32(bio->bi_bdev->bd_dev); + r.sector = cpu_to_be64(to); + + __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, + !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); +} + +/** + * blk_add_driver_data - Add binary message with driver-specific data + * @q: queue the io is for + * @rq: io request + * @data: driver-specific data + * @len: length of driver-specific data + * + * Description: + * Some drivers might want to write driver-specific data per request. + * + **/ +void blk_add_driver_data(struct request_queue *q, + struct request *rq, + void *data, size_t len) +{ + struct blk_trace *bt = q->blk_trace; + + if (likely(!bt)) + return; + + if (blk_pc_request(rq)) + __blk_add_trace(bt, 0, rq->data_len, 0, BLK_TA_DRV_DATA, + rq->errors, len, data); + else + __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, + 0, BLK_TA_DRV_DATA, rq->errors, len, data); +} +EXPORT_SYMBOL_GPL(blk_add_driver_data); + +static int blk_register_tracepoints(void) +{ + int ret; + + ret = register_trace_block_rq_abort(blk_add_trace_rq_abort); + WARN_ON(ret); + ret = register_trace_block_rq_insert(blk_add_trace_rq_insert); + WARN_ON(ret); + ret = register_trace_block_rq_issue(blk_add_trace_rq_issue); + WARN_ON(ret); + ret = register_trace_block_rq_requeue(blk_add_trace_rq_requeue); + WARN_ON(ret); + ret = register_trace_block_rq_complete(blk_add_trace_rq_complete); + WARN_ON(ret); + ret = register_trace_block_bio_bounce(blk_add_trace_bio_bounce); + WARN_ON(ret); + ret = register_trace_block_bio_complete(blk_add_trace_bio_complete); + WARN_ON(ret); + ret = register_trace_block_bio_backmerge(blk_add_trace_bio_backmerge); + WARN_ON(ret); + ret = register_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge); + WARN_ON(ret); + ret = register_trace_block_bio_queue(blk_add_trace_bio_queue); + WARN_ON(ret); + ret = register_trace_block_getrq(blk_add_trace_getrq); + WARN_ON(ret); + ret = register_trace_block_sleeprq(blk_add_trace_sleeprq); + WARN_ON(ret); + ret = register_trace_block_plug(blk_add_trace_plug); + WARN_ON(ret); + ret = register_trace_block_unplug_timer(blk_add_trace_unplug_timer); + WARN_ON(ret); + ret = register_trace_block_unplug_io(blk_add_trace_unplug_io); + WARN_ON(ret); + ret = register_trace_block_split(blk_add_trace_split); + WARN_ON(ret); + ret = register_trace_block_remap(blk_add_trace_remap); + WARN_ON(ret); + return 0; +} + +static void blk_unregister_tracepoints(void) +{ + unregister_trace_block_remap(blk_add_trace_remap); + unregister_trace_block_split(blk_add_trace_split); + unregister_trace_block_unplug_io(blk_add_trace_unplug_io); + unregister_trace_block_unplug_timer(blk_add_trace_unplug_timer); + unregister_trace_block_plug(blk_add_trace_plug); + unregister_trace_block_sleeprq(blk_add_trace_sleeprq); + unregister_trace_block_getrq(blk_add_trace_getrq); + unregister_trace_block_bio_queue(blk_add_trace_bio_queue); + unregister_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge); + unregister_trace_block_bio_backmerge(blk_add_trace_bio_backmerge); + unregister_trace_block_bio_complete(blk_add_trace_bio_complete); + unregister_trace_block_bio_bounce(blk_add_trace_bio_bounce); + unregister_trace_block_rq_complete(blk_add_trace_rq_complete); + unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue); + unregister_trace_block_rq_issue(blk_add_trace_rq_issue); + unregister_trace_block_rq_insert(blk_add_trace_rq_insert); + unregister_trace_block_rq_abort(blk_add_trace_rq_abort); + + tracepoint_synchronize_unregister(); +} + +/* + * struct blk_io_tracer formatting routines + */ + +static void fill_rwbs(char *rwbs, const struct blk_io_trace *t) +{ + int i = 0; + + if (t->action & BLK_TC_DISCARD) + rwbs[i++] = 'D'; + else if (t->action & BLK_TC_WRITE) + rwbs[i++] = 'W'; + else if (t->bytes) + rwbs[i++] = 'R'; + else + rwbs[i++] = 'N'; + + if (t->action & BLK_TC_AHEAD) + rwbs[i++] = 'A'; + if (t->action & BLK_TC_BARRIER) + rwbs[i++] = 'B'; + if (t->action & BLK_TC_SYNC) + rwbs[i++] = 'S'; + if (t->action & BLK_TC_META) + rwbs[i++] = 'M'; + + rwbs[i] = '\0'; +} + +static inline +const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent) +{ + return (const struct blk_io_trace *)ent; +} + +static inline const void *pdu_start(const struct trace_entry *ent) +{ + return te_blk_io_trace(ent) + 1; +} + +static inline u32 t_sec(const struct trace_entry *ent) +{ + return te_blk_io_trace(ent)->bytes >> 9; +} + +static inline unsigned long long t_sector(const struct trace_entry *ent) +{ + return te_blk_io_trace(ent)->sector; +} + +static inline __u16 t_error(const struct trace_entry *ent) +{ + return te_blk_io_trace(ent)->sector; +} + +static __u64 get_pdu_int(const struct trace_entry *ent) +{ + const __u64 *val = pdu_start(ent); + return be64_to_cpu(*val); +} + +static void get_pdu_remap(const struct trace_entry *ent, + struct blk_io_trace_remap *r) +{ + const struct blk_io_trace_remap *__r = pdu_start(ent); + __u64 sector = __r->sector; + + r->device = be32_to_cpu(__r->device); + r->device_from = be32_to_cpu(__r->device_from); + r->sector = be64_to_cpu(sector); +} + +static int blk_log_action_iter(struct trace_iterator *iter, const char *act) +{ + char rwbs[6]; + unsigned long long ts = ns2usecs(iter->ts); + unsigned long usec_rem = do_div(ts, USEC_PER_SEC); + unsigned secs = (unsigned long)ts; + const struct trace_entry *ent = iter->ent; + const struct blk_io_trace *t = (const struct blk_io_trace *)ent; + + fill_rwbs(rwbs, t); + + return trace_seq_printf(&iter->seq, + "%3d,%-3d %2d %5d.%06lu %5u %2s %3s ", + MAJOR(t->device), MINOR(t->device), iter->cpu, + secs, usec_rem, ent->pid, act, rwbs); +} + +static int blk_log_action_seq(struct trace_seq *s, const struct blk_io_trace *t, + const char *act) +{ + char rwbs[6]; + fill_rwbs(rwbs, t); + return trace_seq_printf(s, "%3d,%-3d %2s %3s ", + MAJOR(t->device), MINOR(t->device), act, rwbs); +} + +static int blk_log_generic(struct trace_seq *s, const struct trace_entry *ent) +{ + const char *cmd = trace_find_cmdline(ent->pid); + + if (t_sec(ent)) + return trace_seq_printf(s, "%llu + %u [%s]\n", + t_sector(ent), t_sec(ent), cmd); + return trace_seq_printf(s, "[%s]\n", cmd); +} + +static int blk_log_with_error(struct trace_seq *s, + const struct trace_entry *ent) +{ + if (t_sec(ent)) + return trace_seq_printf(s, "%llu + %u [%d]\n", t_sector(ent), + t_sec(ent), t_error(ent)); + return trace_seq_printf(s, "%llu [%d]\n", t_sector(ent), t_error(ent)); +} + +static int blk_log_remap(struct trace_seq *s, const struct trace_entry *ent) +{ + struct blk_io_trace_remap r = { .device = 0, }; + + get_pdu_remap(ent, &r); + return trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n", + t_sector(ent), + t_sec(ent), MAJOR(r.device), MINOR(r.device), + (unsigned long long)r.sector); +} + +static int blk_log_plug(struct trace_seq *s, const struct trace_entry *ent) +{ + return trace_seq_printf(s, "[%s]\n", trace_find_cmdline(ent->pid)); +} + +static int blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent) +{ + return trace_seq_printf(s, "[%s] %llu\n", trace_find_cmdline(ent->pid), + get_pdu_int(ent)); +} + +static int blk_log_split(struct trace_seq *s, const struct trace_entry *ent) +{ + return trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent), + get_pdu_int(ent), trace_find_cmdline(ent->pid)); +} + +/* + * struct tracer operations + */ + +static void blk_tracer_print_header(struct seq_file *m) +{ + if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) + return; + seq_puts(m, "# DEV CPU TIMESTAMP PID ACT FLG\n" + "# | | | | | |\n"); +} + +static void blk_tracer_start(struct trace_array *tr) +{ + mutex_lock(&blk_probe_mutex); + if (atomic_add_return(1, &blk_probes_ref) == 1) + if (blk_register_tracepoints()) + atomic_dec(&blk_probes_ref); + mutex_unlock(&blk_probe_mutex); + trace_flags &= ~TRACE_ITER_CONTEXT_INFO; +} + +static int blk_tracer_init(struct trace_array *tr) +{ + blk_tr = tr; + blk_tracer_start(tr); + mutex_lock(&blk_probe_mutex); + blk_tracer_enabled++; + mutex_unlock(&blk_probe_mutex); + return 0; +} + +static void blk_tracer_stop(struct trace_array *tr) +{ + trace_flags |= TRACE_ITER_CONTEXT_INFO; + mutex_lock(&blk_probe_mutex); + if (atomic_dec_and_test(&blk_probes_ref)) + blk_unregister_tracepoints(); + mutex_unlock(&blk_probe_mutex); +} + +static void blk_tracer_reset(struct trace_array *tr) +{ + if (!atomic_read(&blk_probes_ref)) + return; + + mutex_lock(&blk_probe_mutex); + blk_tracer_enabled--; + WARN_ON(blk_tracer_enabled < 0); + mutex_unlock(&blk_probe_mutex); + + blk_tracer_stop(tr); +} + +static struct { + const char *act[2]; + int (*print)(struct trace_seq *s, const struct trace_entry *ent); +} what2act[] __read_mostly = { + [__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic }, + [__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic }, + [__BLK_TA_FRONTMERGE] = {{ "F", "frontmerge" }, blk_log_generic }, + [__BLK_TA_GETRQ] = {{ "G", "getrq" }, blk_log_generic }, + [__BLK_TA_SLEEPRQ] = {{ "S", "sleeprq" }, blk_log_generic }, + [__BLK_TA_REQUEUE] = {{ "R", "requeue" }, blk_log_with_error }, + [__BLK_TA_ISSUE] = {{ "D", "issue" }, blk_log_generic }, + [__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error }, + [__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug }, + [__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug }, + [__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug }, + [__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic }, + [__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split }, + [__BLK_TA_BOUNCE] = {{ "B", "bounce" }, blk_log_generic }, + [__BLK_TA_REMAP] = {{ "A", "remap" }, blk_log_remap }, +}; + +static enum print_line_t blk_trace_event_print(struct trace_iterator *iter, + int flags) +{ + struct trace_seq *s = &iter->seq; + const struct blk_io_trace *t = (struct blk_io_trace *)iter->ent; + const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1); + int ret; + + if (!trace_print_context(iter)) + return TRACE_TYPE_PARTIAL_LINE; + + if (unlikely(what == 0 || what > ARRAY_SIZE(what2act))) + ret = trace_seq_printf(s, "Bad pc action %x\n", what); + else { + const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE); + ret = blk_log_action_seq(s, t, what2act[what].act[long_act]); + if (ret) + ret = what2act[what].print(s, iter->ent); + } + + return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE; +} + +static int blk_trace_synthesize_old_trace(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + struct blk_io_trace *t = (struct blk_io_trace *)iter->ent; + const int offset = offsetof(struct blk_io_trace, sector); + struct blk_io_trace old = { + .magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION, + .time = ns2usecs(iter->ts), + }; + + if (!trace_seq_putmem(s, &old, offset)) + return 0; + return trace_seq_putmem(s, &t->sector, + sizeof(old) - offset + t->pdu_len); +} + +static enum print_line_t +blk_trace_event_print_binary(struct trace_iterator *iter, int flags) +{ + return blk_trace_synthesize_old_trace(iter) ? + TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE; +} + +static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter) +{ + const struct blk_io_trace *t; + u16 what; + int ret; + + if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC)) + return TRACE_TYPE_UNHANDLED; + + t = (const struct blk_io_trace *)iter->ent; + what = t->action & ((1 << BLK_TC_SHIFT) - 1); + + if (unlikely(what == 0 || what > ARRAY_SIZE(what2act))) + ret = trace_seq_printf(&iter->seq, "Bad pc action %x\n", what); + else { + const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE); + ret = blk_log_action_iter(iter, what2act[what].act[long_act]); + if (ret) + ret = what2act[what].print(&iter->seq, iter->ent); + } + + return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE; +} + +static struct tracer blk_tracer __read_mostly = { + .name = "blk", + .init = blk_tracer_init, + .reset = blk_tracer_reset, + .start = blk_tracer_start, + .stop = blk_tracer_stop, + .print_header = blk_tracer_print_header, + .print_line = blk_tracer_print_line, + .flags = &blk_tracer_flags, +}; + +static struct trace_event trace_blk_event = { + .type = TRACE_BLK, + .trace = blk_trace_event_print, + .latency_trace = blk_trace_event_print, + .binary = blk_trace_event_print_binary, +}; + +static int __init init_blk_tracer(void) +{ + if (!register_ftrace_event(&trace_blk_event)) { + pr_warning("Warning: could not register block events\n"); + return 1; + } + + if (register_tracer(&blk_tracer) != 0) { + pr_warning("Warning: could not register the block tracer\n"); + unregister_ftrace_event(&trace_blk_event); + return 1; + } + + return 0; +} + +device_initcall(init_blk_tracer); + +static int blk_trace_remove_queue(struct request_queue *q) +{ + struct blk_trace *bt; + + bt = xchg(&q->blk_trace, NULL); + if (bt == NULL) + return -EINVAL; + + kfree(bt); + return 0; +} + +/* + * Setup everything required to start tracing + */ +static int blk_trace_setup_queue(struct request_queue *q, dev_t dev) +{ + struct blk_trace *old_bt, *bt = NULL; + int ret; + + ret = -ENOMEM; + bt = kzalloc(sizeof(*bt), GFP_KERNEL); + if (!bt) + goto err; + + bt->dev = dev; + bt->act_mask = (u16)-1; + bt->end_lba = -1ULL; + bt->trace_state = Blktrace_running; + + old_bt = xchg(&q->blk_trace, bt); + if (old_bt != NULL) { + (void)xchg(&q->blk_trace, old_bt); + kfree(bt); + ret = -EBUSY; + } + return 0; +err: + return ret; +} + +/* + * sysfs interface to enable and configure tracing + */ + +static ssize_t sysfs_blk_trace_enable_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hd_struct *p = dev_to_part(dev); + struct block_device *bdev; + ssize_t ret = -ENXIO; + + lock_kernel(); + bdev = bdget(part_devt(p)); + if (bdev != NULL) { + struct request_queue *q = bdev_get_queue(bdev); + + if (q != NULL) { + mutex_lock(&bdev->bd_mutex); + ret = sprintf(buf, "%u\n", !!q->blk_trace); + mutex_unlock(&bdev->bd_mutex); + } + + bdput(bdev); + } + + unlock_kernel(); + return ret; +} + +static ssize_t sysfs_blk_trace_enable_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct block_device *bdev; + struct request_queue *q; + struct hd_struct *p; + int value; + ssize_t ret = -ENXIO; + + if (count == 0 || sscanf(buf, "%d", &value) != 1) + goto out; + + lock_kernel(); + p = dev_to_part(dev); + bdev = bdget(part_devt(p)); + if (bdev == NULL) + goto out_unlock_kernel; + + q = bdev_get_queue(bdev); + if (q == NULL) + goto out_bdput; + + mutex_lock(&bdev->bd_mutex); + if (value) + ret = blk_trace_setup_queue(q, bdev->bd_dev); + else + ret = blk_trace_remove_queue(q); + mutex_unlock(&bdev->bd_mutex); + + if (ret == 0) + ret = count; +out_bdput: + bdput(bdev); +out_unlock_kernel: + unlock_kernel(); +out: + return ret; +} + +static ssize_t sysfs_blk_trace_attr_show(struct device *dev, + struct device_attribute *attr, + char *buf); +static ssize_t sysfs_blk_trace_attr_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count); +#define BLK_TRACE_DEVICE_ATTR(_name) \ + DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \ + sysfs_blk_trace_attr_show, \ + sysfs_blk_trace_attr_store) + +static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR, + sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store); +static BLK_TRACE_DEVICE_ATTR(act_mask); +static BLK_TRACE_DEVICE_ATTR(pid); +static BLK_TRACE_DEVICE_ATTR(start_lba); +static BLK_TRACE_DEVICE_ATTR(end_lba); + +static struct attribute *blk_trace_attrs[] = { + &dev_attr_enable.attr, + &dev_attr_act_mask.attr, + &dev_attr_pid.attr, + &dev_attr_start_lba.attr, + &dev_attr_end_lba.attr, + NULL +}; + +struct attribute_group blk_trace_attr_group = { + .name = "trace", + .attrs = blk_trace_attrs, +}; + +static int blk_str2act_mask(const char *str) +{ + int mask = 0; + char *copy = kstrdup(str, GFP_KERNEL), *s; + + if (copy == NULL) + return -ENOMEM; + + s = strstrip(copy); + + while (1) { + char *sep = strchr(s, ','); + + if (sep != NULL) + *sep = '\0'; + + if (strcasecmp(s, "barrier") == 0) + mask |= BLK_TC_BARRIER; + else if (strcasecmp(s, "complete") == 0) + mask |= BLK_TC_COMPLETE; + else if (strcasecmp(s, "fs") == 0) + mask |= BLK_TC_FS; + else if (strcasecmp(s, "issue") == 0) + mask |= BLK_TC_ISSUE; + else if (strcasecmp(s, "pc") == 0) + mask |= BLK_TC_PC; + else if (strcasecmp(s, "queue") == 0) + mask |= BLK_TC_QUEUE; + else if (strcasecmp(s, "read") == 0) + mask |= BLK_TC_READ; + else if (strcasecmp(s, "requeue") == 0) + mask |= BLK_TC_REQUEUE; + else if (strcasecmp(s, "sync") == 0) + mask |= BLK_TC_SYNC; + else if (strcasecmp(s, "write") == 0) + mask |= BLK_TC_WRITE; + + if (sep == NULL) + break; + + s = sep + 1; + } + kfree(copy); + + return mask; +} + +static ssize_t sysfs_blk_trace_attr_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hd_struct *p = dev_to_part(dev); + struct request_queue *q; + struct block_device *bdev; + ssize_t ret = -ENXIO; + + lock_kernel(); + bdev = bdget(part_devt(p)); + if (bdev == NULL) + goto out_unlock_kernel; + + q = bdev_get_queue(bdev); + if (q == NULL) + goto out_bdput; + mutex_lock(&bdev->bd_mutex); + if (q->blk_trace == NULL) + ret = sprintf(buf, "disabled\n"); + else if (attr == &dev_attr_act_mask) + ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask); + else if (attr == &dev_attr_pid) + ret = sprintf(buf, "%u\n", q->blk_trace->pid); + else if (attr == &dev_attr_start_lba) + ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba); + else if (attr == &dev_attr_end_lba) + ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba); + mutex_unlock(&bdev->bd_mutex); +out_bdput: + bdput(bdev); +out_unlock_kernel: + unlock_kernel(); + return ret; +} + +static ssize_t sysfs_blk_trace_attr_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct block_device *bdev; + struct request_queue *q; + struct hd_struct *p; + u64 value; + ssize_t ret = -ENXIO; + + if (count == 0) + goto out; + + if (attr == &dev_attr_act_mask) { + if (sscanf(buf, "%llx", &value) != 1) { + /* Assume it is a list of trace category names */ + value = blk_str2act_mask(buf); + if (value < 0) + goto out; + } + } else if (sscanf(buf, "%llu", &value) != 1) + goto out; + + lock_kernel(); + p = dev_to_part(dev); + bdev = bdget(part_devt(p)); + if (bdev == NULL) + goto out_unlock_kernel; + + q = bdev_get_queue(bdev); + if (q == NULL) + goto out_bdput; + + mutex_lock(&bdev->bd_mutex); + ret = 0; + if (q->blk_trace == NULL) + ret = blk_trace_setup_queue(q, bdev->bd_dev); + + if (ret == 0) { + if (attr == &dev_attr_act_mask) + q->blk_trace->act_mask = value; + else if (attr == &dev_attr_pid) + q->blk_trace->pid = value; + else if (attr == &dev_attr_start_lba) + q->blk_trace->start_lba = value; + else if (attr == &dev_attr_end_lba) + q->blk_trace->end_lba = value; + ret = count; + } + mutex_unlock(&bdev->bd_mutex); +out_bdput: + bdput(bdev); +out_unlock_kernel: + unlock_kernel(); +out: + return ret; +} Index: linux-2.6-tip/kernel/trace/ftrace.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/ftrace.c +++ linux-2.6-tip/kernel/trace/ftrace.c @@ -27,6 +27,7 @@ #include #include #include +#include #include @@ -44,14 +45,14 @@ ftrace_kill(); \ } while (0) +/* hash bits for specific function selection */ +#define FTRACE_HASH_BITS 7 +#define FTRACE_FUNC_HASHSIZE (1 << FTRACE_HASH_BITS) + /* ftrace_enabled is a method to turn ftrace on or off */ int ftrace_enabled __read_mostly; static int last_ftrace_enabled; -/* set when tracing only a pid */ -struct pid *ftrace_pid_trace; -static struct pid * const ftrace_swapper_pid = &init_struct_pid; - /* Quick disabling of function tracer. */ int function_trace_stop; @@ -61,9 +62,7 @@ int function_trace_stop; */ static int ftrace_disabled __read_mostly; -static DEFINE_SPINLOCK(ftrace_lock); -static DEFINE_MUTEX(ftrace_sysctl_lock); -static DEFINE_MUTEX(ftrace_start_lock); +static DEFINE_MUTEX(ftrace_lock); static struct ftrace_ops ftrace_list_end __read_mostly = { @@ -134,9 +133,6 @@ static void ftrace_test_stop_func(unsign static int __register_ftrace_function(struct ftrace_ops *ops) { - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); - ops->next = ftrace_list; /* * We are entering ops into the ftrace_list but another @@ -172,18 +168,12 @@ static int __register_ftrace_function(st #endif } - spin_unlock(&ftrace_lock); - return 0; } static int __unregister_ftrace_function(struct ftrace_ops *ops) { struct ftrace_ops **p; - int ret = 0; - - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); /* * If we are removing the last function, then simply point @@ -192,17 +182,15 @@ static int __unregister_ftrace_function( if (ftrace_list == ops && ops->next == &ftrace_list_end) { ftrace_trace_function = ftrace_stub; ftrace_list = &ftrace_list_end; - goto out; + return 0; } for (p = &ftrace_list; *p != &ftrace_list_end; p = &(*p)->next) if (*p == ops) break; - if (*p != ops) { - ret = -1; - goto out; - } + if (*p != ops) + return -1; *p = (*p)->next; @@ -223,18 +211,14 @@ static int __unregister_ftrace_function( } } - out: - spin_unlock(&ftrace_lock); - - return ret; + return 0; } static void ftrace_update_pid_func(void) { ftrace_func_t func; - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); + mutex_lock(&ftrace_lock); if (ftrace_trace_function == ftrace_stub) goto out; @@ -256,21 +240,30 @@ static void ftrace_update_pid_func(void) #endif out: - spin_unlock(&ftrace_lock); + mutex_unlock(&ftrace_lock); } +/* set when tracing only a pid */ +struct pid *ftrace_pid_trace; +static struct pid * const ftrace_swapper_pid = &init_struct_pid; + #ifdef CONFIG_DYNAMIC_FTRACE + #ifndef CONFIG_FTRACE_MCOUNT_RECORD # error Dynamic ftrace depends on MCOUNT_RECORD #endif -/* - * Since MCOUNT_ADDR may point to mcount itself, we do not want - * to get it confused by reading a reference in the code as we - * are parsing on objcopy output of text. Use a variable for - * it instead. - */ -static unsigned long mcount_addr = MCOUNT_ADDR; +static struct hlist_head ftrace_func_hash[FTRACE_FUNC_HASHSIZE] __read_mostly; + +struct ftrace_func_probe { + struct hlist_node node; + struct ftrace_probe_ops *ops; + unsigned long flags; + unsigned long ip; + void *data; + struct rcu_head rcu; +}; + enum { FTRACE_ENABLE_CALLS = (1 << 0), @@ -290,7 +283,7 @@ static DEFINE_MUTEX(ftrace_regex_lock); struct ftrace_page { struct ftrace_page *next; - unsigned long index; + int index; struct dyn_ftrace records[]; }; @@ -305,6 +298,19 @@ static struct ftrace_page *ftrace_pages; static struct dyn_ftrace *ftrace_free_records; +/* + * This is a double for. Do not use 'break' to break out of the loop, + * you must use a goto. + */ +#define do_for_each_ftrace_rec(pg, rec) \ + for (pg = ftrace_pages_start; pg; pg = pg->next) { \ + int _____i; \ + for (_____i = 0; _____i < pg->index; _____i++) { \ + rec = &pg->records[_____i]; + +#define while_for_each_ftrace_rec() \ + } \ + } #ifdef CONFIG_KPROBES @@ -349,23 +355,16 @@ void ftrace_release(void *start, unsigne struct ftrace_page *pg; unsigned long s = (unsigned long)start; unsigned long e = s + size; - int i; if (ftrace_disabled || !start) return; - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); - - for (pg = ftrace_pages_start; pg; pg = pg->next) { - for (i = 0; i < pg->index; i++) { - rec = &pg->records[i]; - - if ((rec->ip >= s) && (rec->ip < e)) - ftrace_free_rec(rec); - } - } - spin_unlock(&ftrace_lock); + mutex_lock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + if ((rec->ip >= s) && (rec->ip < e)) + ftrace_free_rec(rec); + } while_for_each_ftrace_rec(); + mutex_unlock(&ftrace_lock); } static struct dyn_ftrace *ftrace_alloc_dyn_node(unsigned long ip) @@ -461,10 +460,10 @@ static void ftrace_bug(int failed, unsig static int __ftrace_replace_code(struct dyn_ftrace *rec, int enable) { - unsigned long ip, fl; unsigned long ftrace_addr; + unsigned long ip, fl; - ftrace_addr = (unsigned long)ftrace_caller; + ftrace_addr = (unsigned long)FTRACE_ADDR; ip = rec->ip; @@ -473,7 +472,7 @@ __ftrace_replace_code(struct dyn_ftrace * it is not enabled then do nothing. * * If this record is not to be traced and - * it is enabled then disabled it. + * it is enabled then disable it. * */ if (rec->flags & FTRACE_FL_NOTRACE) { @@ -493,7 +492,7 @@ __ftrace_replace_code(struct dyn_ftrace if (fl == (FTRACE_FL_FILTER | FTRACE_FL_ENABLED)) return 0; - /* Record is not filtered and is not enabled do nothing */ + /* Record is not filtered or enabled, do nothing */ if (!fl) return 0; @@ -515,7 +514,7 @@ __ftrace_replace_code(struct dyn_ftrace } else { - /* if record is not enabled do nothing */ + /* if record is not enabled, do nothing */ if (!(rec->flags & FTRACE_FL_ENABLED)) return 0; @@ -531,41 +530,40 @@ __ftrace_replace_code(struct dyn_ftrace static void ftrace_replace_code(int enable) { - int i, failed; struct dyn_ftrace *rec; struct ftrace_page *pg; + int failed; - for (pg = ftrace_pages_start; pg; pg = pg->next) { - for (i = 0; i < pg->index; i++) { - rec = &pg->records[i]; - - /* - * Skip over free records and records that have - * failed. - */ - if (rec->flags & FTRACE_FL_FREE || - rec->flags & FTRACE_FL_FAILED) - continue; + do_for_each_ftrace_rec(pg, rec) { + /* + * Skip over free records and records that have + * failed. + */ + if (rec->flags & FTRACE_FL_FREE || + rec->flags & FTRACE_FL_FAILED) + continue; - /* ignore updates to this record's mcount site */ - if (get_kprobe((void *)rec->ip)) { - freeze_record(rec); - continue; - } else { - unfreeze_record(rec); - } + /* ignore updates to this record's mcount site */ + if (get_kprobe((void *)rec->ip)) { + freeze_record(rec); + continue; + } else { + unfreeze_record(rec); + } - failed = __ftrace_replace_code(rec, enable); - if (failed && (rec->flags & FTRACE_FL_CONVERTED)) { - rec->flags |= FTRACE_FL_FAILED; - if ((system_state == SYSTEM_BOOTING) || - !core_kernel_text(rec->ip)) { - ftrace_free_rec(rec); - } else - ftrace_bug(failed, rec->ip); - } + failed = __ftrace_replace_code(rec, enable); + if (failed && (rec->flags & FTRACE_FL_CONVERTED)) { + rec->flags |= FTRACE_FL_FAILED; + if ((system_state == SYSTEM_BOOTING) || + !core_kernel_text(rec->ip)) { + ftrace_free_rec(rec); + } else { + ftrace_bug(failed, rec->ip); + /* Stop processing */ + return; + } } - } + } while_for_each_ftrace_rec(); } static int @@ -576,7 +574,7 @@ ftrace_code_disable(struct module *mod, ip = rec->ip; - ret = ftrace_make_nop(mod, rec, mcount_addr); + ret = ftrace_make_nop(mod, rec, MCOUNT_ADDR); if (ret) { ftrace_bug(ret, ip); rec->flags |= FTRACE_FL_FAILED; @@ -585,6 +583,24 @@ ftrace_code_disable(struct module *mod, return 1; } +/* + * archs can override this function if they must do something + * before the modifying code is performed. + */ +int __weak ftrace_arch_code_modify_prepare(void) +{ + return 0; +} + +/* + * archs can override this function if they must do something + * after the modifying code is performed. + */ +int __weak ftrace_arch_code_modify_post_process(void) +{ + return 0; +} + static int __ftrace_modify_code(void *data) { int *command = data; @@ -607,7 +623,17 @@ static int __ftrace_modify_code(void *da static void ftrace_run_update_code(int command) { + int ret; + + ret = ftrace_arch_code_modify_prepare(); + FTRACE_WARN_ON(ret); + if (ret) + return; + stop_machine(__ftrace_modify_code, &command, NULL); + + ret = ftrace_arch_code_modify_post_process(); + FTRACE_WARN_ON(ret); } static ftrace_func_t saved_ftrace_func; @@ -631,13 +657,10 @@ static void ftrace_startup(int command) if (unlikely(ftrace_disabled)) return; - mutex_lock(&ftrace_start_lock); ftrace_start_up++; command |= FTRACE_ENABLE_CALLS; ftrace_startup_enable(command); - - mutex_unlock(&ftrace_start_lock); } static void ftrace_shutdown(int command) @@ -645,7 +668,6 @@ static void ftrace_shutdown(int command) if (unlikely(ftrace_disabled)) return; - mutex_lock(&ftrace_start_lock); ftrace_start_up--; if (!ftrace_start_up) command |= FTRACE_DISABLE_CALLS; @@ -656,11 +678,9 @@ static void ftrace_shutdown(int command) } if (!command || !ftrace_enabled) - goto out; + return; ftrace_run_update_code(command); - out: - mutex_unlock(&ftrace_start_lock); } static void ftrace_startup_sysctl(void) @@ -670,7 +690,6 @@ static void ftrace_startup_sysctl(void) if (unlikely(ftrace_disabled)) return; - mutex_lock(&ftrace_start_lock); /* Force update next time */ saved_ftrace_func = NULL; /* ftrace_start_up is true if we want ftrace running */ @@ -678,7 +697,6 @@ static void ftrace_startup_sysctl(void) command |= FTRACE_ENABLE_CALLS; ftrace_run_update_code(command); - mutex_unlock(&ftrace_start_lock); } static void ftrace_shutdown_sysctl(void) @@ -688,13 +706,11 @@ static void ftrace_shutdown_sysctl(void) if (unlikely(ftrace_disabled)) return; - mutex_lock(&ftrace_start_lock); /* ftrace_start_up is true if ftrace is running */ if (ftrace_start_up) command |= FTRACE_DISABLE_CALLS; ftrace_run_update_code(command); - mutex_unlock(&ftrace_start_lock); } static cycle_t ftrace_update_time; @@ -781,13 +797,16 @@ enum { FTRACE_ITER_CONT = (1 << 1), FTRACE_ITER_NOTRACE = (1 << 2), FTRACE_ITER_FAILURES = (1 << 3), + FTRACE_ITER_PRINTALL = (1 << 4), + FTRACE_ITER_HASH = (1 << 5), }; #define FTRACE_BUFF_MAX (KSYM_SYMBOL_LEN+4) /* room for wildcards */ struct ftrace_iterator { struct ftrace_page *pg; - unsigned idx; + int hidx; + int idx; unsigned flags; unsigned char buffer[FTRACE_BUFF_MAX+1]; unsigned buffer_idx; @@ -795,15 +814,89 @@ struct ftrace_iterator { }; static void * +t_hash_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct ftrace_iterator *iter = m->private; + struct hlist_node *hnd = v; + struct hlist_head *hhd; + + WARN_ON(!(iter->flags & FTRACE_ITER_HASH)); + + (*pos)++; + + retry: + if (iter->hidx >= FTRACE_FUNC_HASHSIZE) + return NULL; + + hhd = &ftrace_func_hash[iter->hidx]; + + if (hlist_empty(hhd)) { + iter->hidx++; + hnd = NULL; + goto retry; + } + + if (!hnd) + hnd = hhd->first; + else { + hnd = hnd->next; + if (!hnd) { + iter->hidx++; + goto retry; + } + } + + return hnd; +} + +static void *t_hash_start(struct seq_file *m, loff_t *pos) +{ + struct ftrace_iterator *iter = m->private; + void *p = NULL; + + iter->flags |= FTRACE_ITER_HASH; + + return t_hash_next(m, p, pos); +} + +static int t_hash_show(struct seq_file *m, void *v) +{ + struct ftrace_func_probe *rec; + struct hlist_node *hnd = v; + char str[KSYM_SYMBOL_LEN]; + + rec = hlist_entry(hnd, struct ftrace_func_probe, node); + + if (rec->ops->print) + return rec->ops->print(m, rec->ip, rec->ops, rec->data); + + kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); + seq_printf(m, "%s:", str); + + kallsyms_lookup((unsigned long)rec->ops->func, NULL, NULL, NULL, str); + seq_printf(m, "%s", str); + + if (rec->data) + seq_printf(m, ":%p", rec->data); + seq_putc(m, '\n'); + + return 0; +} + +static void * t_next(struct seq_file *m, void *v, loff_t *pos) { struct ftrace_iterator *iter = m->private; struct dyn_ftrace *rec = NULL; + if (iter->flags & FTRACE_ITER_HASH) + return t_hash_next(m, v, pos); + (*pos)++; - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); + if (iter->flags & FTRACE_ITER_PRINTALL) + return NULL; + retry: if (iter->idx >= iter->pg->index) { if (iter->pg->next) { @@ -832,7 +925,6 @@ t_next(struct seq_file *m, void *v, loff goto retry; } } - spin_unlock(&ftrace_lock); return rec; } @@ -842,6 +934,23 @@ static void *t_start(struct seq_file *m, struct ftrace_iterator *iter = m->private; void *p = NULL; + mutex_lock(&ftrace_lock); + /* + * For set_ftrace_filter reading, if we have the filter + * off, we can short cut and just print out that all + * functions are enabled. + */ + if (iter->flags & FTRACE_ITER_FILTER && !ftrace_filtered) { + if (*pos > 0) + return t_hash_start(m, pos); + iter->flags |= FTRACE_ITER_PRINTALL; + (*pos)++; + return iter; + } + + if (iter->flags & FTRACE_ITER_HASH) + return t_hash_start(m, pos); + if (*pos > 0) { if (iter->idx < 0) return p; @@ -851,18 +960,31 @@ static void *t_start(struct seq_file *m, p = t_next(m, p, pos); + if (!p) + return t_hash_start(m, pos); + return p; } static void t_stop(struct seq_file *m, void *p) { + mutex_unlock(&ftrace_lock); } static int t_show(struct seq_file *m, void *v) { + struct ftrace_iterator *iter = m->private; struct dyn_ftrace *rec = v; char str[KSYM_SYMBOL_LEN]; + if (iter->flags & FTRACE_ITER_HASH) + return t_hash_show(m, v); + + if (iter->flags & FTRACE_ITER_PRINTALL) { + seq_printf(m, "#### all functions enabled ####\n"); + return 0; + } + if (!rec) return 0; @@ -941,23 +1063,16 @@ static void ftrace_filter_reset(int enab struct ftrace_page *pg; struct dyn_ftrace *rec; unsigned long type = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; - unsigned i; - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); + mutex_lock(&ftrace_lock); if (enable) ftrace_filtered = 0; - pg = ftrace_pages_start; - while (pg) { - for (i = 0; i < pg->index; i++) { - rec = &pg->records[i]; - if (rec->flags & FTRACE_FL_FAILED) - continue; - rec->flags &= ~type; - } - pg = pg->next; - } - spin_unlock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + if (rec->flags & FTRACE_FL_FAILED) + continue; + rec->flags &= ~type; + } while_for_each_ftrace_rec(); + mutex_unlock(&ftrace_lock); } static int @@ -1038,86 +1153,536 @@ enum { MATCH_END_ONLY, }; -static void -ftrace_match(unsigned char *buff, int len, int enable) +/* + * (static function - no need for kernel doc) + * + * Pass in a buffer containing a glob and this function will + * set search to point to the search part of the buffer and + * return the type of search it is (see enum above). + * This does modify buff. + * + * Returns enum type. + * search returns the pointer to use for comparison. + * not returns 1 if buff started with a '!' + * 0 otherwise. + */ +static int +ftrace_setup_glob(char *buff, int len, char **search, int *not) { - char str[KSYM_SYMBOL_LEN]; - char *search = NULL; - struct ftrace_page *pg; - struct dyn_ftrace *rec; int type = MATCH_FULL; - unsigned long flag = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; - unsigned i, match = 0, search_len = 0; - int not = 0; + int i; if (buff[0] == '!') { - not = 1; + *not = 1; buff++; len--; - } + } else + *not = 0; + + *search = buff; for (i = 0; i < len; i++) { if (buff[i] == '*') { if (!i) { - search = buff + i + 1; + *search = buff + 1; type = MATCH_END_ONLY; - search_len = len - (i + 1); } else { - if (type == MATCH_END_ONLY) { + if (type == MATCH_END_ONLY) type = MATCH_MIDDLE_ONLY; - } else { - match = i; + else type = MATCH_FRONT_ONLY; - } buff[i] = 0; break; } } } - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); - if (enable) - ftrace_filtered = 1; - pg = ftrace_pages_start; - while (pg) { - for (i = 0; i < pg->index; i++) { - int matched = 0; - char *ptr; + return type; +} + +static int ftrace_match(char *str, char *regex, int len, int type) +{ + int matched = 0; + char *ptr; + + switch (type) { + case MATCH_FULL: + if (strcmp(str, regex) == 0) + matched = 1; + break; + case MATCH_FRONT_ONLY: + if (strncmp(str, regex, len) == 0) + matched = 1; + break; + case MATCH_MIDDLE_ONLY: + if (strstr(str, regex)) + matched = 1; + break; + case MATCH_END_ONLY: + ptr = strstr(str, regex); + if (ptr && (ptr[len] == 0)) + matched = 1; + break; + } + + return matched; +} + +static int +ftrace_match_record(struct dyn_ftrace *rec, char *regex, int len, int type) +{ + char str[KSYM_SYMBOL_LEN]; + + kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); + return ftrace_match(str, regex, len, type); +} + +static void ftrace_match_records(char *buff, int len, int enable) +{ + unsigned int search_len; + struct ftrace_page *pg; + struct dyn_ftrace *rec; + unsigned long flag; + char *search; + int type; + int not; + + flag = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; + type = ftrace_setup_glob(buff, len, &search, ¬); + + search_len = strlen(search); + + mutex_lock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + + if (rec->flags & FTRACE_FL_FAILED) + continue; + + if (ftrace_match_record(rec, search, search_len, type)) { + if (not) + rec->flags &= ~flag; + else + rec->flags |= flag; + } + /* + * Only enable filtering if we have a function that + * is filtered on. + */ + if (enable && (rec->flags & FTRACE_FL_FILTER)) + ftrace_filtered = 1; + } while_for_each_ftrace_rec(); + mutex_unlock(&ftrace_lock); +} + +static int +ftrace_match_module_record(struct dyn_ftrace *rec, char *mod, + char *regex, int len, int type) +{ + char str[KSYM_SYMBOL_LEN]; + char *modname; + + kallsyms_lookup(rec->ip, NULL, NULL, &modname, str); + + if (!modname || strcmp(modname, mod)) + return 0; + + /* blank search means to match all funcs in the mod */ + if (len) + return ftrace_match(str, regex, len, type); + else + return 1; +} + +static void ftrace_match_module_records(char *buff, char *mod, int enable) +{ + unsigned search_len = 0; + struct ftrace_page *pg; + struct dyn_ftrace *rec; + int type = MATCH_FULL; + char *search = buff; + unsigned long flag; + int not = 0; + + flag = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; + + /* blank or '*' mean the same */ + if (strcmp(buff, "*") == 0) + buff[0] = 0; + + /* handle the case of 'dont filter this module' */ + if (strcmp(buff, "!") == 0 || strcmp(buff, "!*") == 0) { + buff[0] = 0; + not = 1; + } + + if (strlen(buff)) { + type = ftrace_setup_glob(buff, strlen(buff), &search, ¬); + search_len = strlen(search); + } + + mutex_lock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + + if (rec->flags & FTRACE_FL_FAILED) + continue; + + if (ftrace_match_module_record(rec, mod, + search, search_len, type)) { + if (not) + rec->flags &= ~flag; + else + rec->flags |= flag; + } + if (enable && (rec->flags & FTRACE_FL_FILTER)) + ftrace_filtered = 1; + + } while_for_each_ftrace_rec(); + mutex_unlock(&ftrace_lock); +} + +/* + * We register the module command as a template to show others how + * to register the a command as well. + */ + +static int +ftrace_mod_callback(char *func, char *cmd, char *param, int enable) +{ + char *mod; + + /* + * cmd == 'mod' because we only registered this func + * for the 'mod' ftrace_func_command. + * But if you register one func with multiple commands, + * you can tell which command was used by the cmd + * parameter. + */ + + /* we must have a module name */ + if (!param) + return -EINVAL; + + mod = strsep(¶m, ":"); + if (!strlen(mod)) + return -EINVAL; + + ftrace_match_module_records(func, mod, enable); + return 0; +} + +static struct ftrace_func_command ftrace_mod_cmd = { + .name = "mod", + .func = ftrace_mod_callback, +}; + +static int __init ftrace_mod_cmd_init(void) +{ + return register_ftrace_command(&ftrace_mod_cmd); +} +device_initcall(ftrace_mod_cmd_init); + +static void +function_trace_probe_call(unsigned long ip, unsigned long parent_ip) +{ + struct ftrace_func_probe *entry; + struct hlist_head *hhd; + struct hlist_node *n; + unsigned long key; + int resched; + + key = hash_long(ip, FTRACE_HASH_BITS); + + hhd = &ftrace_func_hash[key]; + + if (hlist_empty(hhd)) + return; + + /* + * Disable preemption for these calls to prevent a RCU grace + * period. This syncs the hash iteration and freeing of items + * on the hash. rcu_read_lock is too dangerous here. + */ + resched = ftrace_preempt_disable(); + hlist_for_each_entry_rcu(entry, n, hhd, node) { + if (entry->ip == ip) + entry->ops->func(ip, parent_ip, &entry->data); + } + ftrace_preempt_enable(resched); +} + +static struct ftrace_ops trace_probe_ops __read_mostly = +{ + .func = function_trace_probe_call, +}; + +static int ftrace_probe_registered; + +static void __enable_ftrace_function_probe(void) +{ + int i; + + if (ftrace_probe_registered) + return; + + for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) { + struct hlist_head *hhd = &ftrace_func_hash[i]; + if (hhd->first) + break; + } + /* Nothing registered? */ + if (i == FTRACE_FUNC_HASHSIZE) + return; + + __register_ftrace_function(&trace_probe_ops); + ftrace_startup(0); + ftrace_probe_registered = 1; +} + +static void __disable_ftrace_function_probe(void) +{ + int i; + + if (!ftrace_probe_registered) + return; + + for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) { + struct hlist_head *hhd = &ftrace_func_hash[i]; + if (hhd->first) + return; + } + + /* no more funcs left */ + __unregister_ftrace_function(&trace_probe_ops); + ftrace_shutdown(0); + ftrace_probe_registered = 0; +} + + +static void ftrace_free_entry_rcu(struct rcu_head *rhp) +{ + struct ftrace_func_probe *entry = + container_of(rhp, struct ftrace_func_probe, rcu); + + if (entry->ops->free) + entry->ops->free(&entry->data); + kfree(entry); +} + + +int +register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops, + void *data) +{ + struct ftrace_func_probe *entry; + struct ftrace_page *pg; + struct dyn_ftrace *rec; + int type, len, not; + unsigned long key; + int count = 0; + char *search; + + type = ftrace_setup_glob(glob, strlen(glob), &search, ¬); + len = strlen(search); + + /* we do not support '!' for function probes */ + if (WARN_ON(not)) + return -EINVAL; + + mutex_lock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + + if (rec->flags & FTRACE_FL_FAILED) + continue; + + if (!ftrace_match_record(rec, search, len, type)) + continue; + + entry = kmalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) { + /* If we did not process any, then return error */ + if (!count) + count = -ENOMEM; + goto out_unlock; + } - rec = &pg->records[i]; - if (rec->flags & FTRACE_FL_FAILED) + count++; + + entry->data = data; + + /* + * The caller might want to do something special + * for each function we find. We call the callback + * to give the caller an opportunity to do so. + */ + if (ops->callback) { + if (ops->callback(rec->ip, &entry->data) < 0) { + /* caller does not like this func */ + kfree(entry); continue; - kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); - switch (type) { - case MATCH_FULL: - if (strcmp(str, buff) == 0) - matched = 1; - break; - case MATCH_FRONT_ONLY: - if (memcmp(str, buff, match) == 0) - matched = 1; - break; - case MATCH_MIDDLE_ONLY: - if (strstr(str, search)) - matched = 1; - break; - case MATCH_END_ONLY: - ptr = strstr(str, search); - if (ptr && (ptr[search_len] == 0)) - matched = 1; - break; } - if (matched) { - if (not) - rec->flags &= ~flag; - else - rec->flags |= flag; + } + + entry->ops = ops; + entry->ip = rec->ip; + + key = hash_long(entry->ip, FTRACE_HASH_BITS); + hlist_add_head_rcu(&entry->node, &ftrace_func_hash[key]); + + } while_for_each_ftrace_rec(); + __enable_ftrace_function_probe(); + + out_unlock: + mutex_unlock(&ftrace_lock); + + return count; +} + +enum { + PROBE_TEST_FUNC = 1, + PROBE_TEST_DATA = 2 +}; + +static void +__unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops, + void *data, int flags) +{ + struct ftrace_func_probe *entry; + struct hlist_node *n, *tmp; + char str[KSYM_SYMBOL_LEN]; + int type = MATCH_FULL; + int i, len = 0; + char *search; + + if (glob && (strcmp(glob, "*") || !strlen(glob))) + glob = NULL; + else { + int not; + + type = ftrace_setup_glob(glob, strlen(glob), &search, ¬); + len = strlen(search); + + /* we do not support '!' for function probes */ + if (WARN_ON(not)) + return; + } + + mutex_lock(&ftrace_lock); + for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) { + struct hlist_head *hhd = &ftrace_func_hash[i]; + + hlist_for_each_entry_safe(entry, n, tmp, hhd, node) { + + /* break up if statements for readability */ + if ((flags & PROBE_TEST_FUNC) && entry->ops != ops) + continue; + + if ((flags & PROBE_TEST_DATA) && entry->data != data) + continue; + + /* do this last, since it is the most expensive */ + if (glob) { + kallsyms_lookup(entry->ip, NULL, NULL, + NULL, str); + if (!ftrace_match(str, glob, len, type)) + continue; } + + hlist_del(&entry->node); + call_rcu(&entry->rcu, ftrace_free_entry_rcu); } - pg = pg->next; } - spin_unlock(&ftrace_lock); + __disable_ftrace_function_probe(); + mutex_unlock(&ftrace_lock); +} + +void +unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops, + void *data) +{ + __unregister_ftrace_function_probe(glob, ops, data, + PROBE_TEST_FUNC | PROBE_TEST_DATA); +} + +void +unregister_ftrace_function_probe_func(char *glob, struct ftrace_probe_ops *ops) +{ + __unregister_ftrace_function_probe(glob, ops, NULL, PROBE_TEST_FUNC); +} + +void unregister_ftrace_function_probe_all(char *glob) +{ + __unregister_ftrace_function_probe(glob, NULL, NULL, 0); +} + +static LIST_HEAD(ftrace_commands); +static DEFINE_MUTEX(ftrace_cmd_mutex); + +int register_ftrace_command(struct ftrace_func_command *cmd) +{ + struct ftrace_func_command *p; + int ret = 0; + + mutex_lock(&ftrace_cmd_mutex); + list_for_each_entry(p, &ftrace_commands, list) { + if (strcmp(cmd->name, p->name) == 0) { + ret = -EBUSY; + goto out_unlock; + } + } + list_add(&cmd->list, &ftrace_commands); + out_unlock: + mutex_unlock(&ftrace_cmd_mutex); + + return ret; +} + +int unregister_ftrace_command(struct ftrace_func_command *cmd) +{ + struct ftrace_func_command *p, *n; + int ret = -ENODEV; + + mutex_lock(&ftrace_cmd_mutex); + list_for_each_entry_safe(p, n, &ftrace_commands, list) { + if (strcmp(cmd->name, p->name) == 0) { + ret = 0; + list_del_init(&p->list); + goto out_unlock; + } + } + out_unlock: + mutex_unlock(&ftrace_cmd_mutex); + + return ret; +} + +static int ftrace_process_regex(char *buff, int len, int enable) +{ + char *func, *command, *next = buff; + struct ftrace_func_command *p; + int ret = -EINVAL; + + func = strsep(&next, ":"); + + if (!next) { + ftrace_match_records(func, len, enable); + return 0; + } + + /* command found */ + + command = strsep(&next, ":"); + + mutex_lock(&ftrace_cmd_mutex); + list_for_each_entry(p, &ftrace_commands, list) { + if (strcmp(p->name, command) == 0) { + ret = p->func(func, command, next, enable); + goto out_unlock; + } + } + out_unlock: + mutex_unlock(&ftrace_cmd_mutex); + + return ret; } static ssize_t @@ -1187,7 +1752,10 @@ ftrace_regex_write(struct file *file, co if (isspace(ch)) { iter->filtered++; iter->buffer[iter->buffer_idx] = 0; - ftrace_match(iter->buffer, iter->buffer_idx, enable); + ret = ftrace_process_regex(iter->buffer, + iter->buffer_idx, enable); + if (ret) + goto out; iter->buffer_idx = 0; } else iter->flags |= FTRACE_ITER_CONT; @@ -1226,7 +1794,7 @@ ftrace_set_regex(unsigned char *buf, int if (reset) ftrace_filter_reset(enable); if (buf) - ftrace_match(buf, len, enable); + ftrace_match_records(buf, len, enable); mutex_unlock(&ftrace_regex_lock); } @@ -1276,15 +1844,13 @@ ftrace_regex_release(struct inode *inode if (iter->buffer_idx) { iter->filtered++; iter->buffer[iter->buffer_idx] = 0; - ftrace_match(iter->buffer, iter->buffer_idx, enable); + ftrace_match_records(iter->buffer, iter->buffer_idx, enable); } - mutex_lock(&ftrace_sysctl_lock); - mutex_lock(&ftrace_start_lock); + mutex_lock(&ftrace_lock); if (ftrace_start_up && ftrace_enabled) ftrace_run_update_code(FTRACE_ENABLE_CALLS); - mutex_unlock(&ftrace_start_lock); - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); kfree(iter); mutex_unlock(&ftrace_regex_lock); @@ -1360,6 +1926,10 @@ static void *g_start(struct seq_file *m, mutex_lock(&graph_lock); + /* Nothing, tell g_show to print all functions are enabled */ + if (!ftrace_graph_count && !*pos) + return (void *)1; + p = g_next(m, p, pos); return p; @@ -1378,6 +1948,11 @@ static int g_show(struct seq_file *m, vo if (!ptr) return 0; + if (ptr == (unsigned long *)1) { + seq_printf(m, "#### all functions enabled ####\n"); + return 0; + } + kallsyms_lookup(*ptr, NULL, NULL, NULL, str); seq_printf(m, "%s\n", str); @@ -1431,42 +2006,52 @@ ftrace_graph_read(struct file *file, cha } static int -ftrace_set_func(unsigned long *array, int idx, char *buffer) +ftrace_set_func(unsigned long *array, int *idx, char *buffer) { - char str[KSYM_SYMBOL_LEN]; struct dyn_ftrace *rec; struct ftrace_page *pg; + int search_len; int found = 0; - int i, j; + int type, not; + char *search; + bool exists; + int i; if (ftrace_disabled) return -ENODEV; - /* should not be called from interrupt context */ - spin_lock(&ftrace_lock); + /* decode regex */ + type = ftrace_setup_glob(buffer, strlen(buffer), &search, ¬); + if (not) + return -EINVAL; - for (pg = ftrace_pages_start; pg; pg = pg->next) { - for (i = 0; i < pg->index; i++) { - rec = &pg->records[i]; + search_len = strlen(search); - if (rec->flags & (FTRACE_FL_FAILED | FTRACE_FL_FREE)) - continue; + mutex_lock(&ftrace_lock); + do_for_each_ftrace_rec(pg, rec) { + + if (*idx >= FTRACE_GRAPH_MAX_FUNCS) + break; + + if (rec->flags & (FTRACE_FL_FAILED | FTRACE_FL_FREE)) + continue; - kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); - if (strcmp(str, buffer) == 0) { + if (ftrace_match_record(rec, search, search_len, type)) { + /* ensure it is not already in the array */ + exists = false; + for (i = 0; i < *idx; i++) + if (array[i] == rec->ip) { + exists = true; + break; + } + if (!exists) { + array[(*idx)++] = rec->ip; found = 1; - for (j = 0; j < idx; j++) - if (array[j] == rec->ip) { - found = 0; - break; - } - if (found) - array[idx] = rec->ip; - break; } } - } - spin_unlock(&ftrace_lock); + } while_for_each_ftrace_rec(); + + mutex_unlock(&ftrace_lock); return found ? 0 : -EINVAL; } @@ -1534,13 +2119,11 @@ ftrace_graph_write(struct file *file, co } buffer[index] = 0; - /* we allow only one at a time */ - ret = ftrace_set_func(array, ftrace_graph_count, buffer); + /* we allow only one expression at a time */ + ret = ftrace_set_func(array, &ftrace_graph_count, buffer); if (ret) goto out; - ftrace_graph_count++; - file->f_pos += read; ret = read; @@ -1604,7 +2187,7 @@ static int ftrace_convert_nops(struct mo unsigned long addr; unsigned long flags; - mutex_lock(&ftrace_start_lock); + mutex_lock(&ftrace_lock); p = start; while (p < end) { addr = ftrace_call_adjust(*p++); @@ -1623,7 +2206,7 @@ static int ftrace_convert_nops(struct mo local_irq_save(flags); ftrace_update_code(mod); local_irq_restore(flags); - mutex_unlock(&ftrace_start_lock); + mutex_unlock(&ftrace_lock); return 0; } @@ -1796,7 +2379,7 @@ ftrace_pid_write(struct file *filp, cons if (ret < 0) return ret; - mutex_lock(&ftrace_start_lock); + mutex_lock(&ftrace_lock); if (val < 0) { /* disable pid tracing */ if (!ftrace_pid_trace) @@ -1835,7 +2418,7 @@ ftrace_pid_write(struct file *filp, cons ftrace_startup_enable(0); out: - mutex_unlock(&ftrace_start_lock); + mutex_unlock(&ftrace_lock); return cnt; } @@ -1863,7 +2446,6 @@ static __init int ftrace_init_debugfs(vo "'set_ftrace_pid' entry\n"); return 0; } - fs_initcall(ftrace_init_debugfs); /** @@ -1898,17 +2480,17 @@ int register_ftrace_function(struct ftra if (unlikely(ftrace_disabled)) return -1; - mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftrace_lock); ret = __register_ftrace_function(ops); ftrace_startup(0); - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); return ret; } /** - * unregister_ftrace_function - unresgister a function for profiling. + * unregister_ftrace_function - unregister a function for profiling. * @ops - ops structure that holds the function to unregister * * Unregister a function that was added to be called by ftrace profiling. @@ -1917,10 +2499,10 @@ int unregister_ftrace_function(struct ft { int ret; - mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftrace_lock); ret = __unregister_ftrace_function(ops); ftrace_shutdown(0); - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); return ret; } @@ -1935,7 +2517,7 @@ ftrace_enable_sysctl(struct ctl_table *t if (unlikely(ftrace_disabled)) return -ENODEV; - mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftrace_lock); ret = proc_dointvec(table, write, file, buffer, lenp, ppos); @@ -1964,7 +2546,7 @@ ftrace_enable_sysctl(struct ctl_table *t } out: - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); return ret; } @@ -2080,7 +2662,7 @@ int register_ftrace_graph(trace_func_gra { int ret = 0; - mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftrace_lock); ftrace_suspend_notifier.notifier_call = ftrace_suspend_notifier_call; register_pm_notifier(&ftrace_suspend_notifier); @@ -2098,13 +2680,13 @@ int register_ftrace_graph(trace_func_gra ftrace_startup(FTRACE_START_FUNC_RET); out: - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); return ret; } void unregister_ftrace_graph(void) { - mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftrace_lock); atomic_dec(&ftrace_graph_active); ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub; @@ -2112,7 +2694,7 @@ void unregister_ftrace_graph(void) ftrace_shutdown(FTRACE_STOP_FUNC_RET); unregister_pm_notifier(&ftrace_suspend_notifier); - mutex_unlock(&ftrace_sysctl_lock); + mutex_unlock(&ftrace_lock); } /* Allocate a return stack for newly created task */ Index: linux-2.6-tip/kernel/trace/kmemtrace.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/kmemtrace.c @@ -0,0 +1,339 @@ +/* + * Memory allocator tracing + * + * Copyright (C) 2008 Eduard - Gabriel Munteanu + * Copyright (C) 2008 Pekka Enberg + * Copyright (C) 2008 Frederic Weisbecker + */ + +#include +#include +#include +#include +#include + +#include "trace.h" +#include "trace_output.h" + +/* Select an alternative, minimalistic output than the original one */ +#define TRACE_KMEM_OPT_MINIMAL 0x1 + +static struct tracer_opt kmem_opts[] = { + /* Default disable the minimalistic output */ + { TRACER_OPT(kmem_minimalistic, TRACE_KMEM_OPT_MINIMAL) }, + { } +}; + +static struct tracer_flags kmem_tracer_flags = { + .val = 0, + .opts = kmem_opts +}; + + +static bool kmem_tracing_enabled __read_mostly; +static struct trace_array *kmemtrace_array; + +static int kmem_trace_init(struct trace_array *tr) +{ + int cpu; + kmemtrace_array = tr; + + for_each_cpu_mask(cpu, cpu_possible_map) + tracing_reset(tr, cpu); + + kmem_tracing_enabled = true; + + return 0; +} + +static void kmem_trace_reset(struct trace_array *tr) +{ + kmem_tracing_enabled = false; +} + +static void kmemtrace_headers(struct seq_file *s) +{ + /* Don't need headers for the original kmemtrace output */ + if (!(kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL)) + return; + + seq_printf(s, "#\n"); + seq_printf(s, "# ALLOC TYPE REQ GIVEN FLAGS " + " POINTER NODE CALLER\n"); + seq_printf(s, "# FREE | | | | " + " | | | |\n"); + seq_printf(s, "# |\n\n"); +} + +/* + * The two following functions give the original output from kmemtrace, + * or something close to....perhaps they need some missing things + */ +static enum print_line_t +kmemtrace_print_alloc_original(struct trace_iterator *iter, + struct kmemtrace_alloc_entry *entry) +{ + struct trace_seq *s = &iter->seq; + int ret; + + /* Taken from the old linux/kmemtrace.h */ + ret = trace_seq_printf(s, "type_id %d call_site %lu ptr %lu " + "bytes_req %lu bytes_alloc %lu gfp_flags %lu node %d\n", + entry->type_id, entry->call_site, (unsigned long) entry->ptr, + (unsigned long) entry->bytes_req, (unsigned long) entry->bytes_alloc, + (unsigned long) entry->gfp_flags, entry->node); + + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t +kmemtrace_print_free_original(struct trace_iterator *iter, + struct kmemtrace_free_entry *entry) +{ + struct trace_seq *s = &iter->seq; + int ret; + + /* Taken from the old linux/kmemtrace.h */ + ret = trace_seq_printf(s, "type_id %d call_site %lu ptr %lu\n", + entry->type_id, entry->call_site, (unsigned long) entry->ptr); + + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + + +/* The two other following provide a more minimalistic output */ +static enum print_line_t +kmemtrace_print_alloc_compress(struct trace_iterator *iter, + struct kmemtrace_alloc_entry *entry) +{ + struct trace_seq *s = &iter->seq; + int ret; + + /* Alloc entry */ + ret = trace_seq_printf(s, " + "); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Type */ + switch (entry->type_id) { + case KMEMTRACE_TYPE_KMALLOC: + ret = trace_seq_printf(s, "K "); + break; + case KMEMTRACE_TYPE_CACHE: + ret = trace_seq_printf(s, "C "); + break; + case KMEMTRACE_TYPE_PAGES: + ret = trace_seq_printf(s, "P "); + break; + default: + ret = trace_seq_printf(s, "? "); + } + + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Requested */ + ret = trace_seq_printf(s, "%4zu ", entry->bytes_req); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Allocated */ + ret = trace_seq_printf(s, "%4zu ", entry->bytes_alloc); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Flags + * TODO: would be better to see the name of the GFP flag names + */ + ret = trace_seq_printf(s, "%08x ", entry->gfp_flags); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Pointer to allocated */ + ret = trace_seq_printf(s, "0x%tx ", (ptrdiff_t)entry->ptr); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Node */ + ret = trace_seq_printf(s, "%4d ", entry->node); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Call site */ + ret = seq_print_ip_sym(s, entry->call_site, 0); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + if (!trace_seq_printf(s, "\n")) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t +kmemtrace_print_free_compress(struct trace_iterator *iter, + struct kmemtrace_free_entry *entry) +{ + struct trace_seq *s = &iter->seq; + int ret; + + /* Free entry */ + ret = trace_seq_printf(s, " - "); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Type */ + switch (entry->type_id) { + case KMEMTRACE_TYPE_KMALLOC: + ret = trace_seq_printf(s, "K "); + break; + case KMEMTRACE_TYPE_CACHE: + ret = trace_seq_printf(s, "C "); + break; + case KMEMTRACE_TYPE_PAGES: + ret = trace_seq_printf(s, "P "); + break; + default: + ret = trace_seq_printf(s, "? "); + } + + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Skip requested/allocated/flags */ + ret = trace_seq_printf(s, " "); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Pointer to allocated */ + ret = trace_seq_printf(s, "0x%tx ", (ptrdiff_t)entry->ptr); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Skip node */ + ret = trace_seq_printf(s, " "); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Call site */ + ret = seq_print_ip_sym(s, entry->call_site, 0); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + if (!trace_seq_printf(s, "\n")) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t kmemtrace_print_line(struct trace_iterator *iter) +{ + struct trace_entry *entry = iter->ent; + + switch (entry->type) { + case TRACE_KMEM_ALLOC: { + struct kmemtrace_alloc_entry *field; + trace_assign_type(field, entry); + if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL) + return kmemtrace_print_alloc_compress(iter, field); + else + return kmemtrace_print_alloc_original(iter, field); + } + + case TRACE_KMEM_FREE: { + struct kmemtrace_free_entry *field; + trace_assign_type(field, entry); + if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL) + return kmemtrace_print_free_compress(iter, field); + else + return kmemtrace_print_free_original(iter, field); + } + + default: + return TRACE_TYPE_UNHANDLED; + } +} + +/* Trace allocations */ +void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr, + size_t bytes_req, + size_t bytes_alloc, + gfp_t gfp_flags, + int node) +{ + struct ring_buffer_event *event; + struct kmemtrace_alloc_entry *entry; + struct trace_array *tr = kmemtrace_array; + + if (!kmem_tracing_enabled) + return; + + event = trace_buffer_lock_reserve(tr, TRACE_KMEM_ALLOC, + sizeof(*entry), 0, 0); + if (!event) + return; + entry = ring_buffer_event_data(event); + + entry->call_site = call_site; + entry->ptr = ptr; + entry->bytes_req = bytes_req; + entry->bytes_alloc = bytes_alloc; + entry->gfp_flags = gfp_flags; + entry->node = node; + + trace_buffer_unlock_commit(tr, event, 0, 0); +} +EXPORT_SYMBOL(kmemtrace_mark_alloc_node); + +void kmemtrace_mark_free(enum kmemtrace_type_id type_id, + unsigned long call_site, + const void *ptr) +{ + struct ring_buffer_event *event; + struct kmemtrace_free_entry *entry; + struct trace_array *tr = kmemtrace_array; + + if (!kmem_tracing_enabled) + return; + + event = trace_buffer_lock_reserve(tr, TRACE_KMEM_FREE, + sizeof(*entry), 0, 0); + if (!event) + return; + entry = ring_buffer_event_data(event); + entry->type_id = type_id; + entry->call_site = call_site; + entry->ptr = ptr; + + trace_buffer_unlock_commit(tr, event, 0, 0); +} +EXPORT_SYMBOL(kmemtrace_mark_free); + +static struct tracer kmem_tracer __read_mostly = { + .name = "kmemtrace", + .init = kmem_trace_init, + .reset = kmem_trace_reset, + .print_line = kmemtrace_print_line, + .print_header = kmemtrace_headers, + .flags = &kmem_tracer_flags +}; + +void kmemtrace_init(void) +{ + /* earliest opportunity to start kmem tracing */ +} + +static int __init init_kmem_tracer(void) +{ + return register_tracer(&kmem_tracer); +} + +device_initcall(init_kmem_tracer); Index: linux-2.6-tip/kernel/trace/ring_buffer.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/ring_buffer.c +++ linux-2.6-tip/kernel/trace/ring_buffer.c @@ -4,9 +4,11 @@ * Copyright (C) 2008 Steven Rostedt */ #include +#include #include #include #include +#include #include #include #include @@ -57,7 +59,7 @@ enum { RB_BUFFERS_DISABLED = 1 << RB_BUFFERS_DISABLED_BIT, }; -static long ring_buffer_flags __read_mostly = RB_BUFFERS_ON; +static unsigned long ring_buffer_flags __read_mostly = RB_BUFFERS_ON; /** * tracing_on - enable all tracing buffers @@ -89,13 +91,22 @@ EXPORT_SYMBOL_GPL(tracing_off); * tracing_off_permanent - permanently disable ring buffers * * This function, once called, will disable all ring buffers - * permanenty. + * permanently. */ void tracing_off_permanent(void) { set_bit(RB_BUFFERS_DISABLED_BIT, &ring_buffer_flags); } +/** + * tracing_is_on - show state of ring buffers enabled + */ +int tracing_is_on(void) +{ + return ring_buffer_flags == RB_BUFFERS_ON; +} +EXPORT_SYMBOL_GPL(tracing_is_on); + #include "trace.h" /* Up this if you want to test the TIME_EXTENTS and normalization */ @@ -123,8 +134,7 @@ void ring_buffer_normalize_time_stamp(in EXPORT_SYMBOL_GPL(ring_buffer_normalize_time_stamp); #define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event)) -#define RB_ALIGNMENT_SHIFT 2 -#define RB_ALIGNMENT (1 << RB_ALIGNMENT_SHIFT) +#define RB_ALIGNMENT 4U #define RB_MAX_SMALL_DATA 28 enum { @@ -133,7 +143,7 @@ enum { }; /* inline for ring buffer fast paths */ -static inline unsigned +static unsigned rb_event_length(struct ring_buffer_event *event) { unsigned length; @@ -151,7 +161,7 @@ rb_event_length(struct ring_buffer_event case RINGBUF_TYPE_DATA: if (event->len) - length = event->len << RB_ALIGNMENT_SHIFT; + length = event->len * RB_ALIGNMENT; else length = event->array[0]; return length + RB_EVNT_HDR_SIZE; @@ -179,7 +189,7 @@ unsigned ring_buffer_event_length(struct EXPORT_SYMBOL_GPL(ring_buffer_event_length); /* inline for ring buffer fast paths */ -static inline void * +static void * rb_event_data(struct ring_buffer_event *event) { BUG_ON(event->type != RINGBUF_TYPE_DATA); @@ -209,7 +219,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data struct buffer_data_page { u64 time_stamp; /* page time stamp */ - local_t commit; /* write commited index */ + local_t commit; /* write committed index */ unsigned char data[]; /* data of buffer page */ }; @@ -229,10 +239,9 @@ static void rb_init_page(struct buffer_d * Also stolen from mm/slob.c. Thanks to Mathieu Desnoyers for pointing * this issue out. */ -static inline void free_buffer_page(struct buffer_page *bpage) +static void free_buffer_page(struct buffer_page *bpage) { - if (bpage->page) - free_page((unsigned long)bpage->page); + free_page((unsigned long)bpage->page); kfree(bpage); } @@ -254,13 +263,13 @@ static inline int test_time_stamp(u64 de struct ring_buffer_per_cpu { int cpu; struct ring_buffer *buffer; - spinlock_t reader_lock; /* serialize readers */ - raw_spinlock_t lock; + raw_spinlock_t reader_lock; /* serialize readers */ + __raw_spinlock_t lock; struct lock_class_key lock_key; struct list_head pages; struct buffer_page *head_page; /* read from head */ struct buffer_page *tail_page; /* write to tail */ - struct buffer_page *commit_page; /* commited pages */ + struct buffer_page *commit_page; /* committed pages */ struct buffer_page *reader_page; unsigned long overrun; unsigned long entries; @@ -273,8 +282,8 @@ struct ring_buffer { unsigned pages; unsigned flags; int cpus; - cpumask_var_t cpumask; atomic_t record_disabled; + cpumask_var_t cpumask; struct mutex mutex; @@ -303,7 +312,7 @@ struct ring_buffer_iter { * check_pages - integrity check of buffer pages * @cpu_buffer: CPU buffer with pages to test * - * As a safty measure we check to make sure the data pages have not + * As a safety measure we check to make sure the data pages have not * been corrupted. */ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) @@ -381,7 +390,7 @@ rb_allocate_cpu_buffer(struct ring_buffe cpu_buffer->cpu = cpu; cpu_buffer->buffer = buffer; spin_lock_init(&cpu_buffer->reader_lock); - cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + cpu_buffer->lock = (__raw_spinlock_t) __RAW_SPIN_LOCK_UNLOCKED; INIT_LIST_HEAD(&cpu_buffer->pages); bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), @@ -811,7 +820,7 @@ rb_event_index(struct ring_buffer_event return (addr & ~PAGE_MASK) - (PAGE_SIZE - BUF_PAGE_SIZE); } -static inline int +static int rb_is_commit(struct ring_buffer_per_cpu *cpu_buffer, struct ring_buffer_event *event) { @@ -825,7 +834,7 @@ rb_is_commit(struct ring_buffer_per_cpu rb_commit_index(cpu_buffer) == index; } -static inline void +static void rb_set_commit_event(struct ring_buffer_per_cpu *cpu_buffer, struct ring_buffer_event *event) { @@ -850,7 +859,7 @@ rb_set_commit_event(struct ring_buffer_p local_set(&cpu_buffer->commit_page->page->commit, index); } -static inline void +static void rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer) { /* @@ -896,7 +905,7 @@ static void rb_reset_reader_page(struct cpu_buffer->reader_page->read = 0; } -static inline void rb_inc_iter(struct ring_buffer_iter *iter) +static void rb_inc_iter(struct ring_buffer_iter *iter) { struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; @@ -926,7 +935,7 @@ static inline void rb_inc_iter(struct ri * and with this, we can determine what to place into the * data field. */ -static inline void +static void rb_update_event(struct ring_buffer_event *event, unsigned type, unsigned length) { @@ -938,15 +947,11 @@ rb_update_event(struct ring_buffer_event break; case RINGBUF_TYPE_TIME_EXTEND: - event->len = - (RB_LEN_TIME_EXTEND + (RB_ALIGNMENT-1)) - >> RB_ALIGNMENT_SHIFT; + event->len = DIV_ROUND_UP(RB_LEN_TIME_EXTEND, RB_ALIGNMENT); break; case RINGBUF_TYPE_TIME_STAMP: - event->len = - (RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1)) - >> RB_ALIGNMENT_SHIFT; + event->len = DIV_ROUND_UP(RB_LEN_TIME_STAMP, RB_ALIGNMENT); break; case RINGBUF_TYPE_DATA: @@ -955,16 +960,14 @@ rb_update_event(struct ring_buffer_event event->len = 0; event->array[0] = length; } else - event->len = - (length + (RB_ALIGNMENT-1)) - >> RB_ALIGNMENT_SHIFT; + event->len = DIV_ROUND_UP(length, RB_ALIGNMENT); break; default: BUG(); } } -static inline unsigned rb_calculate_event_length(unsigned length) +static unsigned rb_calculate_event_length(unsigned length) { struct ring_buffer_event event; /* Used only for sizeof array */ @@ -990,6 +993,7 @@ __rb_reserve_next(struct ring_buffer_per struct ring_buffer *buffer = cpu_buffer->buffer; struct ring_buffer_event *event; unsigned long flags; + bool lock_taken = false; commit_page = cpu_buffer->commit_page; /* we just need to protect against interrupts */ @@ -1003,7 +1007,30 @@ __rb_reserve_next(struct ring_buffer_per struct buffer_page *next_page = tail_page; local_irq_save(flags); - __raw_spin_lock(&cpu_buffer->lock); + /* + * Since the write to the buffer is still not + * fully lockless, we must be careful with NMIs. + * The locks in the writers are taken when a write + * crosses to a new page. The locks protect against + * races with the readers (this will soon be fixed + * with a lockless solution). + * + * Because we can not protect against NMIs, and we + * want to keep traces reentrant, we need to manage + * what happens when we are in an NMI. + * + * NMIs can happen after we take the lock. + * If we are in an NMI, only take the lock + * if it is not already taken. Otherwise + * simply fail. + */ + if (unlikely(in_nmi())) { + if (!__raw_spin_trylock(&cpu_buffer->lock)) + goto out_reset; + } else + __raw_spin_lock(&cpu_buffer->lock); + + lock_taken = true; rb_inc_page(cpu_buffer, &next_page); @@ -1012,7 +1039,7 @@ __rb_reserve_next(struct ring_buffer_per /* we grabbed the lock before incrementing */ if (RB_WARN_ON(cpu_buffer, next_page == reader_page)) - goto out_unlock; + goto out_reset; /* * If for some reason, we had an interrupt storm that made @@ -1021,12 +1048,12 @@ __rb_reserve_next(struct ring_buffer_per */ if (unlikely(next_page == commit_page)) { WARN_ON_ONCE(1); - goto out_unlock; + goto out_reset; } if (next_page == head_page) { if (!(buffer->flags & RB_FL_OVERWRITE)) - goto out_unlock; + goto out_reset; /* tail_page has not moved yet? */ if (tail_page == cpu_buffer->tail_page) { @@ -1100,12 +1127,13 @@ __rb_reserve_next(struct ring_buffer_per return event; - out_unlock: + out_reset: /* reset write */ if (tail <= BUF_PAGE_SIZE) local_set(&tail_page->write, tail); - __raw_spin_unlock(&cpu_buffer->lock); + if (likely(lock_taken)) + __raw_spin_unlock(&cpu_buffer->lock); local_irq_restore(flags); return NULL; } @@ -1265,7 +1293,6 @@ static DEFINE_PER_CPU(int, rb_need_resch * ring_buffer_lock_reserve - reserve a part of the buffer * @buffer: the ring buffer to reserve from * @length: the length of the data to reserve (excluding event header) - * @flags: a pointer to save the interrupt flags * * Returns a reseverd event on the ring buffer to copy directly to. * The user of this interface will need to get the body to write into @@ -1278,9 +1305,7 @@ static DEFINE_PER_CPU(int, rb_need_resch * If NULL is returned, then nothing has been allocated or locked. */ struct ring_buffer_event * -ring_buffer_lock_reserve(struct ring_buffer *buffer, - unsigned long length, - unsigned long *flags) +ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length) { struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; @@ -1347,15 +1372,13 @@ static void rb_commit(struct ring_buffer * ring_buffer_unlock_commit - commit a reserved * @buffer: The buffer to commit to * @event: The event pointer to commit. - * @flags: the interrupt flags received from ring_buffer_lock_reserve. * * This commits the data to the ring buffer, and releases any locks held. * * Must be paired with ring_buffer_lock_reserve. */ int ring_buffer_unlock_commit(struct ring_buffer *buffer, - struct ring_buffer_event *event, - unsigned long flags) + struct ring_buffer_event *event) { struct ring_buffer_per_cpu *cpu_buffer; int cpu = raw_smp_processor_id(); @@ -1438,7 +1461,7 @@ int ring_buffer_write(struct ring_buffer } EXPORT_SYMBOL_GPL(ring_buffer_write); -static inline int rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) +static int rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) { struct buffer_page *reader = cpu_buffer->reader_page; struct buffer_page *head = cpu_buffer->head_page; @@ -2277,9 +2300,24 @@ int ring_buffer_swap_cpu(struct ring_buf if (buffer_a->pages != buffer_b->pages) return -EINVAL; + if (ring_buffer_flags != RB_BUFFERS_ON) + return -EAGAIN; + + if (atomic_read(&buffer_a->record_disabled)) + return -EAGAIN; + + if (atomic_read(&buffer_b->record_disabled)) + return -EAGAIN; + cpu_buffer_a = buffer_a->buffers[cpu]; cpu_buffer_b = buffer_b->buffers[cpu]; + if (atomic_read(&cpu_buffer_a->record_disabled)) + return -EAGAIN; + + if (atomic_read(&cpu_buffer_b->record_disabled)) + return -EAGAIN; + /* * We can't do a synchronize_sched here because this * function can be called in atomic context. @@ -2303,13 +2341,14 @@ int ring_buffer_swap_cpu(struct ring_buf EXPORT_SYMBOL_GPL(ring_buffer_swap_cpu); static void rb_remove_entries(struct ring_buffer_per_cpu *cpu_buffer, - struct buffer_data_page *bpage) + struct buffer_data_page *bpage, + unsigned int offset) { struct ring_buffer_event *event; unsigned long head; __raw_spin_lock(&cpu_buffer->lock); - for (head = 0; head < local_read(&bpage->commit); + for (head = offset; head < local_read(&bpage->commit); head += rb_event_length(event)) { event = __rb_data_page_index(bpage, head); @@ -2377,12 +2416,12 @@ void ring_buffer_free_read_page(struct r * to swap with a page in the ring buffer. * * for example: - * rpage = ring_buffer_alloc_page(buffer); + * rpage = ring_buffer_alloc_read_page(buffer); * if (!rpage) * return error; * ret = ring_buffer_read_page(buffer, &rpage, cpu, 0); - * if (ret) - * process_page(rpage); + * if (ret >= 0) + * process_page(rpage, ret); * * When @full is set, the function will not return true unless * the writer is off the reader page. @@ -2393,8 +2432,8 @@ void ring_buffer_free_read_page(struct r * responsible for that. * * Returns: - * 1 if data has been transferred - * 0 if no data has been transferred. + * >=0 if data has been transferred, returns the offset of consumed data. + * <0 if no data has been transferred. */ int ring_buffer_read_page(struct ring_buffer *buffer, void **data_page, int cpu, int full) @@ -2403,7 +2442,8 @@ int ring_buffer_read_page(struct ring_bu struct ring_buffer_event *event; struct buffer_data_page *bpage; unsigned long flags; - int ret = 0; + unsigned int read; + int ret = -1; if (!data_page) return 0; @@ -2425,25 +2465,29 @@ int ring_buffer_read_page(struct ring_bu /* check for data */ if (!local_read(&cpu_buffer->reader_page->page->commit)) goto out; + + read = cpu_buffer->reader_page->read; /* * If the writer is already off of the read page, then simply * switch the read page with the given page. Otherwise * we need to copy the data from the reader to the writer. */ if (cpu_buffer->reader_page == cpu_buffer->commit_page) { - unsigned int read = cpu_buffer->reader_page->read; + unsigned int commit = rb_page_commit(cpu_buffer->reader_page); + struct buffer_data_page *rpage = cpu_buffer->reader_page->page; if (full) goto out; /* The writer is still on the reader page, we must copy */ - bpage = cpu_buffer->reader_page->page; - memcpy(bpage->data, - cpu_buffer->reader_page->page->data + read, - local_read(&bpage->commit) - read); + memcpy(bpage->data + read, rpage->data + read, commit - read); /* consume what was read */ - cpu_buffer->reader_page += read; + cpu_buffer->reader_page->read = commit; + /* update bpage */ + local_set(&bpage->commit, commit); + if (!read) + bpage->time_stamp = rpage->time_stamp; } else { /* swap the pages */ rb_init_page(bpage); @@ -2452,10 +2496,10 @@ int ring_buffer_read_page(struct ring_bu cpu_buffer->reader_page->read = 0; *data_page = bpage; } - ret = 1; + ret = read; /* update the entry counter */ - rb_remove_entries(cpu_buffer, bpage); + rb_remove_entries(cpu_buffer, bpage, read); out: spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); @@ -2466,7 +2510,7 @@ static ssize_t rb_simple_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) { - long *p = filp->private_data; + unsigned long *p = filp->private_data; char buf[64]; int r; @@ -2482,9 +2526,9 @@ static ssize_t rb_simple_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - long *p = filp->private_data; + unsigned long *p = filp->private_data; char buf[64]; - long val; + unsigned long val; int ret; if (cnt >= sizeof(buf)) Index: linux-2.6-tip/kernel/trace/trace.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace.c +++ linux-2.6-tip/kernel/trace/trace.c @@ -31,12 +31,14 @@ #include #include #include +#include #include #include #include #include "trace.h" +#include "trace_output.h" #define TRACE_BUFFER_FLAGS (RB_FL_OVERWRITE) @@ -52,6 +54,11 @@ unsigned long __read_mostly tracing_thre */ static bool __read_mostly tracing_selftest_running; +/* + * If a tracer is running, we do not want to run SELFTEST. + */ +static bool __read_mostly tracing_selftest_disabled; + /* For tracers that don't implement custom flags */ static struct tracer_opt dummy_tracer_opt[] = { { } @@ -73,7 +80,7 @@ static int dummy_set_flag(u32 old_flags, * of the tracer is successful. But that is the only place that sets * this back to zero. */ -int tracing_disabled = 1; +static int tracing_disabled = 1; static DEFINE_PER_CPU(local_t, ftrace_cpu_disabled); @@ -109,14 +116,19 @@ static cpumask_var_t __read_mostly traci */ int ftrace_dump_on_oops; -static int tracing_set_tracer(char *buf); +static int tracing_set_tracer(const char *buf); + +#define BOOTUP_TRACER_SIZE 100 +static char bootup_tracer_buf[BOOTUP_TRACER_SIZE] __initdata; +static char *default_bootup_tracer; static int __init set_ftrace(char *str) { - tracing_set_tracer(str); + strncpy(bootup_tracer_buf, str, BOOTUP_TRACER_SIZE); + default_bootup_tracer = bootup_tracer_buf; return 1; } -__setup("ftrace", set_ftrace); +__setup("ftrace=", set_ftrace); static int __init set_ftrace_dump_on_oops(char *str) { @@ -186,9 +198,6 @@ int tracing_is_enabled(void) return tracer_enabled; } -/* function tracing enabled */ -int ftrace_function_enabled; - /* * trace_buf_size is the size in bytes that is allocated * for a buffer. Note, the number of bytes is always rounded @@ -229,7 +238,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wai /* trace_flags holds trace_options default values */ unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK | - TRACE_ITER_ANNOTATE; + TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO; /** * trace_wake_up - wake up tasks waiting for trace input @@ -239,6 +248,10 @@ unsigned long trace_flags = TRACE_ITER_P */ void trace_wake_up(void) { +#ifdef CONFIG_PREEMPT_RT + if (in_atomic() || irqs_disabled()) + return; +#endif /* * The runqueue_is_locked() can fail, but this is the best we * have for now: @@ -287,6 +300,7 @@ static const char *trace_options[] = { "userstacktrace", "sym-userobj", "printk-msg-only", + "context-info", NULL }; @@ -299,8 +313,7 @@ static const char *trace_options[] = { * This is defined as a raw_spinlock_t in order to help * with performance when lockdep debugging is enabled. */ -static raw_spinlock_t ftrace_max_lock = - (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; +static __raw_spinlock_t ftrace_max_lock = __RAW_SPIN_LOCK_UNLOCKED; /* * Copy the new maximum trace into the separate maximum-trace @@ -326,133 +339,7 @@ __update_max_tr(struct trace_array *tr, data->rt_priority = tsk->rt_priority; /* record this tasks comm */ - tracing_record_cmdline(current); -} - -/** - * trace_seq_printf - sequence printing of trace information - * @s: trace sequence descriptor - * @fmt: printf format string - * - * The tracer may use either sequence operations or its own - * copy to user routines. To simplify formating of a trace - * trace_seq_printf is used to store strings into a special - * buffer (@s). Then the output may be either used by - * the sequencer or pulled into another buffer. - */ -int -trace_seq_printf(struct trace_seq *s, const char *fmt, ...) -{ - int len = (PAGE_SIZE - 1) - s->len; - va_list ap; - int ret; - - if (!len) - return 0; - - va_start(ap, fmt); - ret = vsnprintf(s->buffer + s->len, len, fmt, ap); - va_end(ap); - - /* If we can't write it all, don't bother writing anything */ - if (ret >= len) - return 0; - - s->len += ret; - - return len; -} - -/** - * trace_seq_puts - trace sequence printing of simple string - * @s: trace sequence descriptor - * @str: simple string to record - * - * The tracer may use either the sequence operations or its own - * copy to user routines. This function records a simple string - * into a special buffer (@s) for later retrieval by a sequencer - * or other mechanism. - */ -static int -trace_seq_puts(struct trace_seq *s, const char *str) -{ - int len = strlen(str); - - if (len > ((PAGE_SIZE - 1) - s->len)) - return 0; - - memcpy(s->buffer + s->len, str, len); - s->len += len; - - return len; -} - -static int -trace_seq_putc(struct trace_seq *s, unsigned char c) -{ - if (s->len >= (PAGE_SIZE - 1)) - return 0; - - s->buffer[s->len++] = c; - - return 1; -} - -static int -trace_seq_putmem(struct trace_seq *s, void *mem, size_t len) -{ - if (len > ((PAGE_SIZE - 1) - s->len)) - return 0; - - memcpy(s->buffer + s->len, mem, len); - s->len += len; - - return len; -} - -#define MAX_MEMHEX_BYTES 8 -#define HEX_CHARS (MAX_MEMHEX_BYTES*2 + 1) - -static int -trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len) -{ - unsigned char hex[HEX_CHARS]; - unsigned char *data = mem; - int i, j; - -#ifdef __BIG_ENDIAN - for (i = 0, j = 0; i < len; i++) { -#else - for (i = len-1, j = 0; i >= 0; i--) { -#endif - hex[j++] = hex_asc_hi(data[i]); - hex[j++] = hex_asc_lo(data[i]); - } - hex[j++] = ' '; - - return trace_seq_putmem(s, hex, j); -} - -static int -trace_seq_path(struct trace_seq *s, struct path *path) -{ - unsigned char *p; - - if (s->len >= (PAGE_SIZE - 1)) - return 0; - p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len); - if (!IS_ERR(p)) { - p = mangle_path(s->buffer + s->len, p, "\n"); - if (p) { - s->len = p - s->buffer; - return 1; - } - } else { - s->buffer[s->len++] = '?'; - return 1; - } - - return 0; + tracing_record_cmdline(tsk); } static void @@ -481,6 +368,25 @@ ssize_t trace_seq_to_user(struct trace_s return cnt; } +ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt) +{ + int len; + void *ret; + + if (s->len <= s->readpos) + return -EBUSY; + + len = s->len - s->readpos; + if (cnt > len) + cnt = len; + ret = memcpy(buf, s->buffer + s->readpos, cnt); + if (!ret) + return -EFAULT; + + s->readpos += len; + return cnt; +} + static void trace_print_seq(struct seq_file *m, struct trace_seq *s) { @@ -543,7 +449,7 @@ update_max_tr_single(struct trace_array ftrace_enable_cpu(); - WARN_ON_ONCE(ret); + WARN_ON_ONCE(ret && ret != -EAGAIN); __update_max_tr(tr, tsk, cpu); __raw_spin_unlock(&ftrace_max_lock); @@ -556,6 +462,8 @@ update_max_tr_single(struct trace_array * Register a new plugin tracer. */ int register_tracer(struct tracer *type) +__releases(kernel_lock) +__acquires(kernel_lock) { struct tracer *t; int len; @@ -594,9 +502,12 @@ int register_tracer(struct tracer *type) else if (!type->flags->opts) type->flags->opts = dummy_tracer_opt; + if (!type->wait_pipe) + type->wait_pipe = default_wait_pipe; + #ifdef CONFIG_FTRACE_STARTUP_TEST - if (type->selftest) { + if (type->selftest && !tracing_selftest_disabled) { struct tracer *saved_tracer = current_trace; struct trace_array *tr = &global_trace; int i; @@ -638,8 +549,26 @@ int register_tracer(struct tracer *type) out: tracing_selftest_running = false; mutex_unlock(&trace_types_lock); - lock_kernel(); + if (ret || !default_bootup_tracer) + goto out_unlock; + + if (strncmp(default_bootup_tracer, type->name, BOOTUP_TRACER_SIZE)) + goto out_unlock; + + printk(KERN_INFO "Starting tracer '%s'\n", type->name); + /* Do we want this tracer to start on bootup? */ + tracing_set_tracer(type->name); + default_bootup_tracer = NULL; + /* disable other selftests, since this will break it. */ + tracing_selftest_disabled = 1; +#ifdef CONFIG_FTRACE_STARTUP_TEST + printk(KERN_INFO "Disabling FTRACE selftests due to running tracer '%s'\n", + type->name); +#endif + + out_unlock: + lock_kernel(); return ret; } @@ -658,6 +587,15 @@ void unregister_tracer(struct tracer *ty found: *t = (*t)->next; + + if (type == current_trace && tracer_enabled) { + tracer_enabled = 0; + tracing_stop(); + if (current_trace->stop) + current_trace->stop(&global_trace); + current_trace = &nop_trace; + } + if (strlen(type->name) != max_tracer_type_len) goto out; @@ -696,7 +634,7 @@ static int cmdline_idx; static DEFINE_SPINLOCK(trace_cmdline_lock); /* temporary disable recording */ -atomic_t trace_record_cmdline_disabled __read_mostly; +static atomic_t trace_record_cmdline_disabled __read_mostly; static void trace_init_cmdlines(void) { @@ -706,7 +644,7 @@ static void trace_init_cmdlines(void) } static int trace_stop_count; -static DEFINE_SPINLOCK(tracing_start_lock); +static DEFINE_RAW_SPINLOCK(tracing_start_lock); /** * ftrace_off_permanent - disable all ftrace code permanently @@ -738,13 +676,12 @@ void tracing_start(void) return; spin_lock_irqsave(&tracing_start_lock, flags); - if (--trace_stop_count) - goto out; - - if (trace_stop_count < 0) { - /* Someone screwed up their debugging */ - WARN_ON_ONCE(1); - trace_stop_count = 0; + if (--trace_stop_count) { + if (trace_stop_count < 0) { + /* Someone screwed up their debugging */ + WARN_ON_ONCE(1); + trace_stop_count = 0; + } goto out; } @@ -876,78 +813,100 @@ tracing_generic_entry_update(struct trac (need_resched() ? TRACE_FLAG_NEED_RESCHED : 0); } +struct ring_buffer_event *trace_buffer_lock_reserve(struct trace_array *tr, + unsigned char type, + unsigned long len, + unsigned long flags, int pc) +{ + struct ring_buffer_event *event; + + event = ring_buffer_lock_reserve(tr->buffer, len); + if (event != NULL) { + struct trace_entry *ent = ring_buffer_event_data(event); + + tracing_generic_entry_update(ent, flags, pc); + ent->type = type; + } + + return event; +} +static void ftrace_trace_stack(struct trace_array *tr, + unsigned long flags, int skip, int pc); +static void ftrace_trace_userstack(struct trace_array *tr, + unsigned long flags, int pc); + +void trace_buffer_unlock_commit(struct trace_array *tr, + struct ring_buffer_event *event, + unsigned long flags, int pc) +{ + ring_buffer_unlock_commit(tr->buffer, event); + + ftrace_trace_stack(tr, flags, 6, pc); + ftrace_trace_userstack(tr, flags, pc); + trace_wake_up(); +} + void -trace_function(struct trace_array *tr, struct trace_array_cpu *data, +trace_function(struct trace_array *tr, unsigned long ip, unsigned long parent_ip, unsigned long flags, int pc) { struct ring_buffer_event *event; struct ftrace_entry *entry; - unsigned long irq_flags; /* If we are reading the ring buffer, don't trace */ if (unlikely(local_read(&__get_cpu_var(ftrace_cpu_disabled)))) return; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_FN, sizeof(*entry), + flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_FN; entry->ip = ip; entry->parent_ip = parent_ip; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); + ring_buffer_unlock_commit(tr->buffer, event); } #ifdef CONFIG_FUNCTION_GRAPH_TRACER static void __trace_graph_entry(struct trace_array *tr, - struct trace_array_cpu *data, struct ftrace_graph_ent *trace, unsigned long flags, int pc) { struct ring_buffer_event *event; struct ftrace_graph_ent_entry *entry; - unsigned long irq_flags; if (unlikely(local_read(&__get_cpu_var(ftrace_cpu_disabled)))) return; - event = ring_buffer_lock_reserve(global_trace.buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(&global_trace, TRACE_GRAPH_ENT, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_GRAPH_ENT; entry->graph_ent = *trace; - ring_buffer_unlock_commit(global_trace.buffer, event, irq_flags); + ring_buffer_unlock_commit(global_trace.buffer, event); } static void __trace_graph_return(struct trace_array *tr, - struct trace_array_cpu *data, struct ftrace_graph_ret *trace, unsigned long flags, int pc) { struct ring_buffer_event *event; struct ftrace_graph_ret_entry *entry; - unsigned long irq_flags; if (unlikely(local_read(&__get_cpu_var(ftrace_cpu_disabled)))) return; - event = ring_buffer_lock_reserve(global_trace.buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(&global_trace, TRACE_GRAPH_RET, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_GRAPH_RET; entry->ret = *trace; - ring_buffer_unlock_commit(global_trace.buffer, event, irq_flags); + ring_buffer_unlock_commit(global_trace.buffer, event); } #endif @@ -957,31 +916,23 @@ ftrace(struct trace_array *tr, struct tr int pc) { if (likely(!atomic_read(&data->disabled))) - trace_function(tr, data, ip, parent_ip, flags, pc); + trace_function(tr, ip, parent_ip, flags, pc); } -static void ftrace_trace_stack(struct trace_array *tr, - struct trace_array_cpu *data, - unsigned long flags, - int skip, int pc) +static void __ftrace_trace_stack(struct trace_array *tr, + unsigned long flags, + int skip, int pc) { #ifdef CONFIG_STACKTRACE struct ring_buffer_event *event; struct stack_entry *entry; struct stack_trace trace; - unsigned long irq_flags; - - if (!(trace_flags & TRACE_ITER_STACKTRACE)) - return; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_STACK, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_STACK; - memset(&entry->caller, 0, sizeof(entry->caller)); trace.nr_entries = 0; @@ -990,38 +941,43 @@ static void ftrace_trace_stack(struct tr trace.entries = entry->caller; save_stack_trace(&trace); - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); + ring_buffer_unlock_commit(tr->buffer, event); #endif } +static void ftrace_trace_stack(struct trace_array *tr, + unsigned long flags, + int skip, int pc) +{ + if (!(trace_flags & TRACE_ITER_STACKTRACE)) + return; + + __ftrace_trace_stack(tr, flags, skip, pc); +} + void __trace_stack(struct trace_array *tr, - struct trace_array_cpu *data, unsigned long flags, - int skip) + int skip, int pc) { - ftrace_trace_stack(tr, data, flags, skip, preempt_count()); + __ftrace_trace_stack(tr, flags, skip, pc); } static void ftrace_trace_userstack(struct trace_array *tr, - struct trace_array_cpu *data, - unsigned long flags, int pc) + unsigned long flags, int pc) { #ifdef CONFIG_STACKTRACE struct ring_buffer_event *event; struct userstack_entry *entry; struct stack_trace trace; - unsigned long irq_flags; if (!(trace_flags & TRACE_ITER_USERSTACKTRACE)) return; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_USER_STACK, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_USER_STACK; memset(&entry->caller, 0, sizeof(entry->caller)); @@ -1031,70 +987,58 @@ static void ftrace_trace_userstack(struc trace.entries = entry->caller; save_stack_trace_user(&trace); - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); + ring_buffer_unlock_commit(tr->buffer, event); #endif } -void __trace_userstack(struct trace_array *tr, - struct trace_array_cpu *data, - unsigned long flags) +#ifdef UNUSED +static void __trace_userstack(struct trace_array *tr, unsigned long flags) { - ftrace_trace_userstack(tr, data, flags, preempt_count()); + ftrace_trace_userstack(tr, flags, preempt_count()); } +#endif /* UNUSED */ static void -ftrace_trace_special(void *__tr, void *__data, +ftrace_trace_special(void *__tr, unsigned long arg1, unsigned long arg2, unsigned long arg3, int pc) { struct ring_buffer_event *event; - struct trace_array_cpu *data = __data; struct trace_array *tr = __tr; struct special_entry *entry; - unsigned long irq_flags; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_SPECIAL, + sizeof(*entry), 0, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, pc); - entry->ent.type = TRACE_SPECIAL; entry->arg1 = arg1; entry->arg2 = arg2; entry->arg3 = arg3; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - ftrace_trace_stack(tr, data, irq_flags, 4, pc); - ftrace_trace_userstack(tr, data, irq_flags, pc); - - trace_wake_up(); + trace_buffer_unlock_commit(tr, event, 0, pc); } void __trace_special(void *__tr, void *__data, unsigned long arg1, unsigned long arg2, unsigned long arg3) { - ftrace_trace_special(__tr, __data, arg1, arg2, arg3, preempt_count()); + ftrace_trace_special(__tr, arg1, arg2, arg3, preempt_count()); } void tracing_sched_switch_trace(struct trace_array *tr, - struct trace_array_cpu *data, struct task_struct *prev, struct task_struct *next, unsigned long flags, int pc) { struct ring_buffer_event *event; struct ctx_switch_entry *entry; - unsigned long irq_flags; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_CTX, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_CTX; entry->prev_pid = prev->pid; entry->prev_prio = prev->prio; entry->prev_state = prev->state; @@ -1102,29 +1046,23 @@ tracing_sched_switch_trace(struct trace_ entry->next_prio = next->prio; entry->next_state = next->state; entry->next_cpu = task_cpu(next); - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - ftrace_trace_stack(tr, data, flags, 5, pc); - ftrace_trace_userstack(tr, data, flags, pc); + trace_buffer_unlock_commit(tr, event, flags, pc); } void tracing_sched_wakeup_trace(struct trace_array *tr, - struct trace_array_cpu *data, struct task_struct *wakee, struct task_struct *curr, unsigned long flags, int pc) { struct ring_buffer_event *event; struct ctx_switch_entry *entry; - unsigned long irq_flags; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_WAKE, + sizeof(*entry), flags, pc); if (!event) return; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_WAKE; entry->prev_pid = curr->pid; entry->prev_prio = curr->prio; entry->prev_state = curr->state; @@ -1132,11 +1070,10 @@ tracing_sched_wakeup_trace(struct trace_ entry->next_prio = wakee->prio; entry->next_state = wakee->state; entry->next_cpu = task_cpu(wakee); - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - ftrace_trace_stack(tr, data, flags, 6, pc); - ftrace_trace_userstack(tr, data, flags, pc); - trace_wake_up(); + ring_buffer_unlock_commit(tr->buffer, event); + ftrace_trace_stack(tr, flags, 6, pc); + ftrace_trace_userstack(tr, flags, pc); } void @@ -1157,66 +1094,7 @@ ftrace_special(unsigned long arg1, unsig data = tr->data[cpu]; if (likely(atomic_inc_return(&data->disabled) == 1)) - ftrace_trace_special(tr, data, arg1, arg2, arg3, pc); - - atomic_dec(&data->disabled); - local_irq_restore(flags); -} - -#ifdef CONFIG_FUNCTION_TRACER -static void -function_trace_call_preempt_only(unsigned long ip, unsigned long parent_ip) -{ - struct trace_array *tr = &global_trace; - struct trace_array_cpu *data; - unsigned long flags; - long disabled; - int cpu, resched; - int pc; - - if (unlikely(!ftrace_function_enabled)) - return; - - pc = preempt_count(); - resched = ftrace_preempt_disable(); - local_save_flags(flags); - cpu = raw_smp_processor_id(); - data = tr->data[cpu]; - disabled = atomic_inc_return(&data->disabled); - - if (likely(disabled == 1)) - trace_function(tr, data, ip, parent_ip, flags, pc); - - atomic_dec(&data->disabled); - ftrace_preempt_enable(resched); -} - -static void -function_trace_call(unsigned long ip, unsigned long parent_ip) -{ - struct trace_array *tr = &global_trace; - struct trace_array_cpu *data; - unsigned long flags; - long disabled; - int cpu; - int pc; - - if (unlikely(!ftrace_function_enabled)) - return; - - /* - * Need to use raw, since this must be called before the - * recursive protection is performed. - */ - local_irq_save(flags); - cpu = raw_smp_processor_id(); - data = tr->data[cpu]; - disabled = atomic_inc_return(&data->disabled); - - if (likely(disabled == 1)) { - pc = preempt_count(); - trace_function(tr, data, ip, parent_ip, flags, pc); - } + ftrace_trace_special(tr, arg1, arg2, arg3, pc); atomic_dec(&data->disabled); local_irq_restore(flags); @@ -1244,7 +1122,7 @@ int trace_graph_entry(struct ftrace_grap disabled = atomic_inc_return(&data->disabled); if (likely(disabled == 1)) { pc = preempt_count(); - __trace_graph_entry(tr, data, trace, flags, pc); + __trace_graph_entry(tr, trace, flags, pc); } /* Only do the atomic if it is not already set */ if (!test_tsk_trace_graph(current)) @@ -1270,7 +1148,7 @@ void trace_graph_return(struct ftrace_gr disabled = atomic_inc_return(&data->disabled); if (likely(disabled == 1)) { pc = preempt_count(); - __trace_graph_return(tr, data, trace, flags, pc); + __trace_graph_return(tr, trace, flags, pc); } if (!trace->depth) clear_tsk_trace_graph(current); @@ -1279,31 +1157,6 @@ void trace_graph_return(struct ftrace_gr } #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ -static struct ftrace_ops trace_ops __read_mostly = -{ - .func = function_trace_call, -}; - -void tracing_start_function_trace(void) -{ - ftrace_function_enabled = 0; - - if (trace_flags & TRACE_ITER_PREEMPTONLY) - trace_ops.func = function_trace_call_preempt_only; - else - trace_ops.func = function_trace_call; - - register_ftrace_function(&trace_ops); - ftrace_function_enabled = 1; -} - -void tracing_stop_function_trace(void) -{ - ftrace_function_enabled = 0; - unregister_ftrace_function(&trace_ops); -} -#endif - enum trace_file_type { TRACE_FILE_LAT_FMT = 1, TRACE_FILE_ANNOTATE = 2, @@ -1376,8 +1229,8 @@ __find_next_entry(struct trace_iterator } /* Find the next real entry, without updating the iterator itself */ -static struct trace_entry * -find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts) +struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, + int *ent_cpu, u64 *ent_ts) { return __find_next_entry(iter, ent_cpu, ent_ts); } @@ -1472,154 +1325,6 @@ static void s_stop(struct seq_file *m, v mutex_unlock(&trace_types_lock); } -#ifdef CONFIG_KRETPROBES -static inline const char *kretprobed(const char *name) -{ - static const char tramp_name[] = "kretprobe_trampoline"; - int size = sizeof(tramp_name); - - if (strncmp(tramp_name, name, size) == 0) - return "[unknown/kretprobe'd]"; - return name; -} -#else -static inline const char *kretprobed(const char *name) -{ - return name; -} -#endif /* CONFIG_KRETPROBES */ - -static int -seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address) -{ -#ifdef CONFIG_KALLSYMS - char str[KSYM_SYMBOL_LEN]; - const char *name; - - kallsyms_lookup(address, NULL, NULL, NULL, str); - - name = kretprobed(str); - - return trace_seq_printf(s, fmt, name); -#endif - return 1; -} - -static int -seq_print_sym_offset(struct trace_seq *s, const char *fmt, - unsigned long address) -{ -#ifdef CONFIG_KALLSYMS - char str[KSYM_SYMBOL_LEN]; - const char *name; - - sprint_symbol(str, address); - name = kretprobed(str); - - return trace_seq_printf(s, fmt, name); -#endif - return 1; -} - -#ifndef CONFIG_64BIT -# define IP_FMT "%08lx" -#else -# define IP_FMT "%016lx" -#endif - -int -seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags) -{ - int ret; - - if (!ip) - return trace_seq_printf(s, "0"); - - if (sym_flags & TRACE_ITER_SYM_OFFSET) - ret = seq_print_sym_offset(s, "%s", ip); - else - ret = seq_print_sym_short(s, "%s", ip); - - if (!ret) - return 0; - - if (sym_flags & TRACE_ITER_SYM_ADDR) - ret = trace_seq_printf(s, " <" IP_FMT ">", ip); - return ret; -} - -static inline int seq_print_user_ip(struct trace_seq *s, struct mm_struct *mm, - unsigned long ip, unsigned long sym_flags) -{ - struct file *file = NULL; - unsigned long vmstart = 0; - int ret = 1; - - if (mm) { - const struct vm_area_struct *vma; - - down_read(&mm->mmap_sem); - vma = find_vma(mm, ip); - if (vma) { - file = vma->vm_file; - vmstart = vma->vm_start; - } - if (file) { - ret = trace_seq_path(s, &file->f_path); - if (ret) - ret = trace_seq_printf(s, "[+0x%lx]", ip - vmstart); - } - up_read(&mm->mmap_sem); - } - if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file)) - ret = trace_seq_printf(s, " <" IP_FMT ">", ip); - return ret; -} - -static int -seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s, - unsigned long sym_flags) -{ - struct mm_struct *mm = NULL; - int ret = 1; - unsigned int i; - - if (trace_flags & TRACE_ITER_SYM_USEROBJ) { - struct task_struct *task; - /* - * we do the lookup on the thread group leader, - * since individual threads might have already quit! - */ - rcu_read_lock(); - task = find_task_by_vpid(entry->ent.tgid); - if (task) - mm = get_task_mm(task); - rcu_read_unlock(); - } - - for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { - unsigned long ip = entry->caller[i]; - - if (ip == ULONG_MAX || !ret) - break; - if (i && ret) - ret = trace_seq_puts(s, " <- "); - if (!ip) { - if (ret) - ret = trace_seq_puts(s, "??"); - continue; - } - if (!ret) - break; - if (ret) - ret = seq_print_user_ip(s, mm, ip, sym_flags); - } - - if (mm) - mmput(mm); - return ret; -} - static void print_lat_help_header(struct seq_file *m) { seq_puts(m, "# _------=> CPU# \n"); @@ -1704,103 +1409,6 @@ print_trace_header(struct seq_file *m, s seq_puts(m, "\n"); } -static void -lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu) -{ - int hardirq, softirq; - char *comm; - - comm = trace_find_cmdline(entry->pid); - - trace_seq_printf(s, "%8.8s-%-5d ", comm, entry->pid); - trace_seq_printf(s, "%3d", cpu); - trace_seq_printf(s, "%c%c", - (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : '.', - ((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.')); - - hardirq = entry->flags & TRACE_FLAG_HARDIRQ; - softirq = entry->flags & TRACE_FLAG_SOFTIRQ; - if (hardirq && softirq) { - trace_seq_putc(s, 'H'); - } else { - if (hardirq) { - trace_seq_putc(s, 'h'); - } else { - if (softirq) - trace_seq_putc(s, 's'); - else - trace_seq_putc(s, '.'); - } - } - - if (entry->preempt_count) - trace_seq_printf(s, "%x", entry->preempt_count); - else - trace_seq_puts(s, "."); -} - -unsigned long preempt_mark_thresh = 100; - -static void -lat_print_timestamp(struct trace_seq *s, u64 abs_usecs, - unsigned long rel_usecs) -{ - trace_seq_printf(s, " %4lldus", abs_usecs); - if (rel_usecs > preempt_mark_thresh) - trace_seq_puts(s, "!: "); - else if (rel_usecs > 1) - trace_seq_puts(s, "+: "); - else - trace_seq_puts(s, " : "); -} - -static const char state_to_char[] = TASK_STATE_TO_CHAR_STR; - -static int task_state_char(unsigned long state) -{ - int bit = state ? __ffs(state) + 1 : 0; - - return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?'; -} - -/* - * The message is supposed to contain an ending newline. - * If the printing stops prematurely, try to add a newline of our own. - */ -void trace_seq_print_cont(struct trace_seq *s, struct trace_iterator *iter) -{ - struct trace_entry *ent; - struct trace_field_cont *cont; - bool ok = true; - - ent = peek_next_entry(iter, iter->cpu, NULL); - if (!ent || ent->type != TRACE_CONT) { - trace_seq_putc(s, '\n'); - return; - } - - do { - cont = (struct trace_field_cont *)ent; - if (ok) - ok = (trace_seq_printf(s, "%s", cont->buf) > 0); - - ftrace_disable_cpu(); - - if (iter->buffer_iter[iter->cpu]) - ring_buffer_read(iter->buffer_iter[iter->cpu], NULL); - else - ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL); - - ftrace_enable_cpu(); - - ent = peek_next_entry(iter, iter->cpu, NULL); - } while (ent && ent->type == TRACE_CONT); - - if (!ok) - trace_seq_putc(s, '\n'); -} - static void test_cpu_buff_start(struct trace_iterator *iter) { struct trace_seq *s = &iter->seq; @@ -1811,145 +1419,38 @@ static void test_cpu_buff_start(struct t if (!(iter->iter_flags & TRACE_FILE_ANNOTATE)) return; - if (cpumask_test_cpu(iter->cpu, iter->started)) - return; - - cpumask_set_cpu(iter->cpu, iter->started); - trace_seq_printf(s, "##### CPU %u buffer started ####\n", iter->cpu); -} - -static enum print_line_t -print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu) -{ - struct trace_seq *s = &iter->seq; - unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); - struct trace_entry *next_entry; - unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE); - struct trace_entry *entry = iter->ent; - unsigned long abs_usecs; - unsigned long rel_usecs; - u64 next_ts; - char *comm; - int S, T; - int i; - - if (entry->type == TRACE_CONT) - return TRACE_TYPE_HANDLED; - - test_cpu_buff_start(iter); - - next_entry = find_next_entry(iter, NULL, &next_ts); - if (!next_entry) - next_ts = iter->ts; - rel_usecs = ns2usecs(next_ts - iter->ts); - abs_usecs = ns2usecs(iter->ts - iter->tr->time_start); - - if (verbose) { - comm = trace_find_cmdline(entry->pid); - trace_seq_printf(s, "%16s %5d %3d %d %08x %08x [%08lx]" - " %ld.%03ldms (+%ld.%03ldms): ", - comm, - entry->pid, cpu, entry->flags, - entry->preempt_count, trace_idx, - ns2usecs(iter->ts), - abs_usecs/1000, - abs_usecs % 1000, rel_usecs/1000, - rel_usecs % 1000); - } else { - lat_print_generic(s, entry, cpu); - lat_print_timestamp(s, abs_usecs, rel_usecs); - } - switch (entry->type) { - case TRACE_FN: { - struct ftrace_entry *field; - - trace_assign_type(field, entry); - - seq_print_ip_sym(s, field->ip, sym_flags); - trace_seq_puts(s, " ("); - seq_print_ip_sym(s, field->parent_ip, sym_flags); - trace_seq_puts(s, ")\n"); - break; - } - case TRACE_CTX: - case TRACE_WAKE: { - struct ctx_switch_entry *field; - - trace_assign_type(field, entry); - - T = task_state_char(field->next_state); - S = task_state_char(field->prev_state); - comm = trace_find_cmdline(field->next_pid); - trace_seq_printf(s, " %5d:%3d:%c %s [%03d] %5d:%3d:%c %s\n", - field->prev_pid, - field->prev_prio, - S, entry->type == TRACE_CTX ? "==>" : " +", - field->next_cpu, - field->next_pid, - field->next_prio, - T, comm); - break; - } - case TRACE_SPECIAL: { - struct special_entry *field; - - trace_assign_type(field, entry); + if (cpumask_test_cpu(iter->cpu, iter->started)) + return; - trace_seq_printf(s, "# %ld %ld %ld\n", - field->arg1, - field->arg2, - field->arg3); - break; - } - case TRACE_STACK: { - struct stack_entry *field; + cpumask_set_cpu(iter->cpu, iter->started); + trace_seq_printf(s, "##### CPU %u buffer started ####\n", iter->cpu); +} - trace_assign_type(field, entry); +static enum print_line_t print_lat_fmt(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); + struct trace_event *event; + struct trace_entry *entry = iter->ent; - for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { - if (i) - trace_seq_puts(s, " <= "); - seq_print_ip_sym(s, field->caller[i], sym_flags); - } - trace_seq_puts(s, "\n"); - break; - } - case TRACE_PRINT: { - struct print_entry *field; + test_cpu_buff_start(iter); - trace_assign_type(field, entry); + event = ftrace_find_event(entry->type); - seq_print_ip_sym(s, field->ip, sym_flags); - trace_seq_printf(s, ": %s", field->buf); - if (entry->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); - break; + if (trace_flags & TRACE_ITER_CONTEXT_INFO) { + if (!trace_print_lat_context(iter)) + goto partial; } - case TRACE_BRANCH: { - struct trace_branch *field; - trace_assign_type(field, entry); + if (event) + return event->latency_trace(iter, sym_flags); - trace_seq_printf(s, "[%s] %s:%s:%d\n", - field->correct ? " ok " : " MISS ", - field->func, - field->file, - field->line); - break; - } - case TRACE_USER_STACK: { - struct userstack_entry *field; + if (!trace_seq_printf(s, "Unknown type %d\n", entry->type)) + goto partial; - trace_assign_type(field, entry); - - seq_print_userip_objs(field, s, sym_flags); - trace_seq_putc(s, '\n'); - break; - } - default: - trace_seq_printf(s, "Unknown type %d\n", entry->type); - } return TRACE_TYPE_HANDLED; +partial: + return TRACE_TYPE_PARTIAL_LINE; } static enum print_line_t print_trace_fmt(struct trace_iterator *iter) @@ -1957,313 +1458,78 @@ static enum print_line_t print_trace_fmt struct trace_seq *s = &iter->seq; unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); struct trace_entry *entry; - unsigned long usec_rem; - unsigned long long t; - unsigned long secs; - char *comm; - int ret; - int S, T; - int i; + struct trace_event *event; entry = iter->ent; - if (entry->type == TRACE_CONT) - return TRACE_TYPE_HANDLED; - test_cpu_buff_start(iter); - comm = trace_find_cmdline(iter->ent->pid); - - t = ns2usecs(iter->ts); - usec_rem = do_div(t, 1000000ULL); - secs = (unsigned long)t; - - ret = trace_seq_printf(s, "%16s-%-5d ", comm, entry->pid); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - ret = trace_seq_printf(s, "[%03d] ", iter->cpu); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - ret = trace_seq_printf(s, "%5lu.%06lu: ", secs, usec_rem); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - - switch (entry->type) { - case TRACE_FN: { - struct ftrace_entry *field; - - trace_assign_type(field, entry); - - ret = seq_print_ip_sym(s, field->ip, sym_flags); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - if ((sym_flags & TRACE_ITER_PRINT_PARENT) && - field->parent_ip) { - ret = trace_seq_printf(s, " <-"); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - ret = seq_print_ip_sym(s, - field->parent_ip, - sym_flags); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } - ret = trace_seq_printf(s, "\n"); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_CTX: - case TRACE_WAKE: { - struct ctx_switch_entry *field; - - trace_assign_type(field, entry); - - T = task_state_char(field->next_state); - S = task_state_char(field->prev_state); - ret = trace_seq_printf(s, " %5d:%3d:%c %s [%03d] %5d:%3d:%c\n", - field->prev_pid, - field->prev_prio, - S, - entry->type == TRACE_CTX ? "==>" : " +", - field->next_cpu, - field->next_pid, - field->next_prio, - T); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_SPECIAL: { - struct special_entry *field; - - trace_assign_type(field, entry); - - ret = trace_seq_printf(s, "# %ld %ld %ld\n", - field->arg1, - field->arg2, - field->arg3); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_STACK: { - struct stack_entry *field; - - trace_assign_type(field, entry); - - for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { - if (i) { - ret = trace_seq_puts(s, " <= "); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } - ret = seq_print_ip_sym(s, field->caller[i], - sym_flags); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } - ret = trace_seq_puts(s, "\n"); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_PRINT: { - struct print_entry *field; - - trace_assign_type(field, entry); + event = ftrace_find_event(entry->type); - seq_print_ip_sym(s, field->ip, sym_flags); - trace_seq_printf(s, ": %s", field->buf); - if (entry->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); - break; - } - case TRACE_GRAPH_RET: { - return print_graph_function(iter); - } - case TRACE_GRAPH_ENT: { - return print_graph_function(iter); + if (trace_flags & TRACE_ITER_CONTEXT_INFO) { + if (!trace_print_context(iter)) + goto partial; } - case TRACE_BRANCH: { - struct trace_branch *field; - trace_assign_type(field, entry); - - trace_seq_printf(s, "[%s] %s:%s:%d\n", - field->correct ? " ok " : " MISS ", - field->func, - field->file, - field->line); - break; - } - case TRACE_USER_STACK: { - struct userstack_entry *field; + if (event) + return event->trace(iter, sym_flags); - trace_assign_type(field, entry); + if (!trace_seq_printf(s, "Unknown type %d\n", entry->type)) + goto partial; - ret = seq_print_userip_objs(field, s, sym_flags); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - ret = trace_seq_putc(s, '\n'); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - } return TRACE_TYPE_HANDLED; +partial: + return TRACE_TYPE_PARTIAL_LINE; } static enum print_line_t print_raw_fmt(struct trace_iterator *iter) { struct trace_seq *s = &iter->seq; struct trace_entry *entry; - int ret; - int S, T; + struct trace_event *event; entry = iter->ent; - if (entry->type == TRACE_CONT) - return TRACE_TYPE_HANDLED; - - ret = trace_seq_printf(s, "%d %d %llu ", - entry->pid, iter->cpu, iter->ts); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - - switch (entry->type) { - case TRACE_FN: { - struct ftrace_entry *field; - - trace_assign_type(field, entry); - - ret = trace_seq_printf(s, "%x %x\n", - field->ip, - field->parent_ip); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_CTX: - case TRACE_WAKE: { - struct ctx_switch_entry *field; - - trace_assign_type(field, entry); - - T = task_state_char(field->next_state); - S = entry->type == TRACE_WAKE ? '+' : - task_state_char(field->prev_state); - ret = trace_seq_printf(s, "%d %d %c %d %d %d %c\n", - field->prev_pid, - field->prev_prio, - S, - field->next_cpu, - field->next_pid, - field->next_prio, - T); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; - } - case TRACE_SPECIAL: - case TRACE_USER_STACK: - case TRACE_STACK: { - struct special_entry *field; - - trace_assign_type(field, entry); - - ret = trace_seq_printf(s, "# %ld %ld %ld\n", - field->arg1, - field->arg2, - field->arg3); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - break; + if (trace_flags & TRACE_ITER_CONTEXT_INFO) { + if (!trace_seq_printf(s, "%d %d %llu ", + entry->pid, iter->cpu, iter->ts)) + goto partial; } - case TRACE_PRINT: { - struct print_entry *field; - trace_assign_type(field, entry); + event = ftrace_find_event(entry->type); + if (event) + return event->raw(iter, 0); + + if (!trace_seq_printf(s, "%d ?\n", entry->type)) + goto partial; - trace_seq_printf(s, "# %lx %s", field->ip, field->buf); - if (entry->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); - break; - } - } return TRACE_TYPE_HANDLED; +partial: + return TRACE_TYPE_PARTIAL_LINE; } -#define SEQ_PUT_FIELD_RET(s, x) \ -do { \ - if (!trace_seq_putmem(s, &(x), sizeof(x))) \ - return 0; \ -} while (0) - -#define SEQ_PUT_HEX_FIELD_RET(s, x) \ -do { \ - BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES); \ - if (!trace_seq_putmem_hex(s, &(x), sizeof(x))) \ - return 0; \ -} while (0) - static enum print_line_t print_hex_fmt(struct trace_iterator *iter) { struct trace_seq *s = &iter->seq; unsigned char newline = '\n'; struct trace_entry *entry; - int S, T; + struct trace_event *event; entry = iter->ent; - if (entry->type == TRACE_CONT) - return TRACE_TYPE_HANDLED; - - SEQ_PUT_HEX_FIELD_RET(s, entry->pid); - SEQ_PUT_HEX_FIELD_RET(s, iter->cpu); - SEQ_PUT_HEX_FIELD_RET(s, iter->ts); - - switch (entry->type) { - case TRACE_FN: { - struct ftrace_entry *field; - - trace_assign_type(field, entry); - - SEQ_PUT_HEX_FIELD_RET(s, field->ip); - SEQ_PUT_HEX_FIELD_RET(s, field->parent_ip); - break; - } - case TRACE_CTX: - case TRACE_WAKE: { - struct ctx_switch_entry *field; - - trace_assign_type(field, entry); - - T = task_state_char(field->next_state); - S = entry->type == TRACE_WAKE ? '+' : - task_state_char(field->prev_state); - SEQ_PUT_HEX_FIELD_RET(s, field->prev_pid); - SEQ_PUT_HEX_FIELD_RET(s, field->prev_prio); - SEQ_PUT_HEX_FIELD_RET(s, S); - SEQ_PUT_HEX_FIELD_RET(s, field->next_cpu); - SEQ_PUT_HEX_FIELD_RET(s, field->next_pid); - SEQ_PUT_HEX_FIELD_RET(s, field->next_prio); - SEQ_PUT_HEX_FIELD_RET(s, T); - break; - } - case TRACE_SPECIAL: - case TRACE_USER_STACK: - case TRACE_STACK: { - struct special_entry *field; - - trace_assign_type(field, entry); - - SEQ_PUT_HEX_FIELD_RET(s, field->arg1); - SEQ_PUT_HEX_FIELD_RET(s, field->arg2); - SEQ_PUT_HEX_FIELD_RET(s, field->arg3); - break; + if (trace_flags & TRACE_ITER_CONTEXT_INFO) { + SEQ_PUT_HEX_FIELD_RET(s, entry->pid); + SEQ_PUT_HEX_FIELD_RET(s, iter->cpu); + SEQ_PUT_HEX_FIELD_RET(s, iter->ts); } + + event = ftrace_find_event(entry->type); + if (event) { + enum print_line_t ret = event->hex(iter, 0); + if (ret != TRACE_TYPE_HANDLED) + return ret; } + SEQ_PUT_FIELD_RET(s, newline); return TRACE_TYPE_HANDLED; @@ -2278,13 +1544,10 @@ static enum print_line_t print_printk_ms trace_assign_type(field, entry); - ret = trace_seq_printf(s, field->buf); + ret = trace_seq_printf(s, "%s", field->buf); if (!ret) return TRACE_TYPE_PARTIAL_LINE; - if (entry->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); - return TRACE_TYPE_HANDLED; } @@ -2292,53 +1555,18 @@ static enum print_line_t print_bin_fmt(s { struct trace_seq *s = &iter->seq; struct trace_entry *entry; + struct trace_event *event; entry = iter->ent; - if (entry->type == TRACE_CONT) - return TRACE_TYPE_HANDLED; - - SEQ_PUT_FIELD_RET(s, entry->pid); - SEQ_PUT_FIELD_RET(s, entry->cpu); - SEQ_PUT_FIELD_RET(s, iter->ts); - - switch (entry->type) { - case TRACE_FN: { - struct ftrace_entry *field; - - trace_assign_type(field, entry); - - SEQ_PUT_FIELD_RET(s, field->ip); - SEQ_PUT_FIELD_RET(s, field->parent_ip); - break; + if (trace_flags & TRACE_ITER_CONTEXT_INFO) { + SEQ_PUT_FIELD_RET(s, entry->pid); + SEQ_PUT_FIELD_RET(s, iter->cpu); + SEQ_PUT_FIELD_RET(s, iter->ts); } - case TRACE_CTX: { - struct ctx_switch_entry *field; - trace_assign_type(field, entry); - - SEQ_PUT_FIELD_RET(s, field->prev_pid); - SEQ_PUT_FIELD_RET(s, field->prev_prio); - SEQ_PUT_FIELD_RET(s, field->prev_state); - SEQ_PUT_FIELD_RET(s, field->next_pid); - SEQ_PUT_FIELD_RET(s, field->next_prio); - SEQ_PUT_FIELD_RET(s, field->next_state); - break; - } - case TRACE_SPECIAL: - case TRACE_USER_STACK: - case TRACE_STACK: { - struct special_entry *field; - - trace_assign_type(field, entry); - - SEQ_PUT_FIELD_RET(s, field->arg1); - SEQ_PUT_FIELD_RET(s, field->arg2); - SEQ_PUT_FIELD_RET(s, field->arg3); - break; - } - } - return 1; + event = ftrace_find_event(entry->type); + return event ? event->binary(iter, 0) : TRACE_TYPE_HANDLED; } static int trace_empty(struct trace_iterator *iter) @@ -2383,7 +1611,7 @@ static enum print_line_t print_trace_lin return print_raw_fmt(iter); if (iter->iter_flags & TRACE_FILE_LAT_FMT) - return print_lat_fmt(iter, iter->idx, iter->cpu); + return print_lat_fmt(iter); return print_trace_fmt(iter); } @@ -2505,7 +1733,7 @@ int tracing_open_generic(struct inode *i return 0; } -int tracing_release(struct inode *inode, struct file *file) +static int tracing_release(struct inode *inode, struct file *file) { struct seq_file *m = (struct seq_file *)file->private_data; struct trace_iterator *iter = m->private; @@ -2748,7 +1976,7 @@ tracing_trace_options_read(struct file * struct tracer_opt *trace_opts = current_trace->flags->opts; - /* calulate max size */ + /* calculate max size */ for (i = 0; trace_options[i]; i++) { len += strlen(trace_options[i]); len += 3; /* "no" and space */ @@ -2930,7 +2158,7 @@ tracing_ctrl_write(struct file *filp, co { struct trace_array *tr = filp->private_data; char buf[64]; - long val; + unsigned long val; int ret; if (cnt >= sizeof(buf)) @@ -2985,7 +2213,13 @@ tracing_set_trace_read(struct file *filp return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); } -static int tracing_set_tracer(char *buf) +int tracer_init(struct tracer *t, struct trace_array *tr) +{ + tracing_reset_online_cpus(tr); + return t->init(tr); +} + +static int tracing_set_tracer(const char *buf) { struct trace_array *tr = &global_trace; struct tracer *t; @@ -3009,7 +2243,7 @@ static int tracing_set_tracer(char *buf) current_trace = t; if (t->init) { - ret = t->init(tr); + ret = tracer_init(t, tr); if (ret) goto out; } @@ -3072,9 +2306,9 @@ static ssize_t tracing_max_lat_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - long *ptr = filp->private_data; + unsigned long *ptr = filp->private_data; char buf[64]; - long val; + unsigned long val; int ret; if (cnt >= sizeof(buf)) @@ -3167,67 +2401,60 @@ tracing_poll_pipe(struct file *filp, pol } } -/* - * Consumer reader. - */ -static ssize_t -tracing_read_pipe(struct file *filp, char __user *ubuf, - size_t cnt, loff_t *ppos) + +void default_wait_pipe(struct trace_iterator *iter) { - struct trace_iterator *iter = filp->private_data; - ssize_t sret; + DEFINE_WAIT(wait); - /* return any leftover data */ - sret = trace_seq_to_user(&iter->seq, ubuf, cnt); - if (sret != -EBUSY) - return sret; + prepare_to_wait(&trace_wait, &wait, TASK_INTERRUPTIBLE); - trace_seq_reset(&iter->seq); + if (trace_empty(iter)) + schedule(); - mutex_lock(&trace_types_lock); - if (iter->trace->read) { - sret = iter->trace->read(iter, filp, ubuf, cnt, ppos); - if (sret) - goto out; - } + finish_wait(&trace_wait, &wait); +} + +/* + * This is a make-shift waitqueue. + * A tracer might use this callback on some rare cases: + * + * 1) the current tracer might hold the runqueue lock when it wakes up + * a reader, hence a deadlock (sched, function, and function graph tracers) + * 2) the function tracers, trace all functions, we don't want + * the overhead of calling wake_up and friends + * (and tracing them too) + * + * Anyway, this is really very primitive wakeup. + */ +void poll_wait_pipe(struct trace_iterator *iter) +{ + set_current_state(TASK_INTERRUPTIBLE); + /* sleep for 100 msecs, and try again. */ + schedule_timeout(HZ / 10); +} + +/* Must be called with trace_types_lock mutex held. */ +static int tracing_wait_pipe(struct file *filp) +{ + struct trace_iterator *iter = filp->private_data; -waitagain: - sret = 0; while (trace_empty(iter)) { if ((filp->f_flags & O_NONBLOCK)) { - sret = -EAGAIN; - goto out; + return -EAGAIN; } - /* - * This is a make-shift waitqueue. The reason we don't use - * an actual wait queue is because: - * 1) we only ever have one waiter - * 2) the tracing, traces all functions, we don't want - * the overhead of calling wake_up and friends - * (and tracing them too) - * Anyway, this is really very primitive wakeup. - */ - set_current_state(TASK_INTERRUPTIBLE); - iter->tr->waiter = current; - mutex_unlock(&trace_types_lock); - /* sleep for 100 msecs, and try again. */ - schedule_timeout(HZ/10); + iter->trace->wait_pipe(iter); mutex_lock(&trace_types_lock); - iter->tr->waiter = NULL; - - if (signal_pending(current)) { - sret = -EINTR; - goto out; - } + if (signal_pending(current)) + return -EINTR; if (iter->trace != current_trace) - goto out; + return 0; /* * We block until we read something and tracing is disabled. @@ -3240,13 +2467,45 @@ waitagain: */ if (!tracer_enabled && iter->pos) break; + } + + return 1; +} + +/* + * Consumer reader. + */ +static ssize_t +tracing_read_pipe(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_iterator *iter = filp->private_data; + ssize_t sret; + + /* return any leftover data */ + sret = trace_seq_to_user(&iter->seq, ubuf, cnt); + if (sret != -EBUSY) + return sret; + + trace_seq_reset(&iter->seq); - continue; + mutex_lock(&trace_types_lock); + if (iter->trace->read) { + sret = iter->trace->read(iter, filp, ubuf, cnt, ppos); + if (sret) + goto out; } +waitagain: + sret = tracing_wait_pipe(filp); + if (sret <= 0) + goto out; + /* stop when tracing is finished */ - if (trace_empty(iter)) + if (trace_empty(iter)) { + sret = 0; goto out; + } if (cnt >= PAGE_SIZE) cnt = PAGE_SIZE - 1; @@ -3267,8 +2526,8 @@ waitagain: iter->seq.len = len; break; } - - trace_consume(iter); + if (ret != TRACE_TYPE_NO_CONSUME) + trace_consume(iter); if (iter->seq.len >= cnt) break; @@ -3292,6 +2551,134 @@ out: return sret; } +static void tracing_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + __free_page(buf->page); +} + +static void tracing_spd_release_pipe(struct splice_pipe_desc *spd, + unsigned int idx) +{ + __free_page(spd->pages[idx]); +} + +static struct pipe_buf_operations tracing_pipe_buf_ops = { + .can_merge = 0, + .map = generic_pipe_buf_map, + .unmap = generic_pipe_buf_unmap, + .confirm = generic_pipe_buf_confirm, + .release = tracing_pipe_buf_release, + .steal = generic_pipe_buf_steal, + .get = generic_pipe_buf_get, +}; + +static size_t +tracing_fill_pipe_page(size_t rem, struct trace_iterator *iter) +{ + size_t count; + int ret; + + /* Seq buffer is page-sized, exactly what we need. */ + for (;;) { + count = iter->seq.len; + ret = print_trace_line(iter); + count = iter->seq.len - count; + if (rem < count) { + rem = 0; + iter->seq.len -= count; + break; + } + if (ret == TRACE_TYPE_PARTIAL_LINE) { + iter->seq.len -= count; + break; + } + + trace_consume(iter); + rem -= count; + if (!find_next_entry_inc(iter)) { + rem = 0; + iter->ent = NULL; + break; + } + } + + return rem; +} + +static ssize_t tracing_splice_read_pipe(struct file *filp, + loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, + unsigned int flags) +{ + struct page *pages[PIPE_BUFFERS]; + struct partial_page partial[PIPE_BUFFERS]; + struct trace_iterator *iter = filp->private_data; + struct splice_pipe_desc spd = { + .pages = pages, + .partial = partial, + .nr_pages = 0, /* This gets updated below. */ + .flags = flags, + .ops = &tracing_pipe_buf_ops, + .spd_release = tracing_spd_release_pipe, + }; + ssize_t ret; + size_t rem; + unsigned int i; + + mutex_lock(&trace_types_lock); + + if (iter->trace->splice_read) { + ret = iter->trace->splice_read(iter, filp, + ppos, pipe, len, flags); + if (ret) + goto out_err; + } + + ret = tracing_wait_pipe(filp); + if (ret <= 0) + goto out_err; + + if (!iter->ent && !find_next_entry_inc(iter)) { + ret = -EFAULT; + goto out_err; + } + + /* Fill as many pages as possible. */ + for (i = 0, rem = len; i < PIPE_BUFFERS && rem; i++) { + pages[i] = alloc_page(GFP_KERNEL); + if (!pages[i]) + break; + + rem = tracing_fill_pipe_page(rem, iter); + + /* Copy the data into the page, so we can start over. */ + ret = trace_seq_to_buffer(&iter->seq, + page_address(pages[i]), + iter->seq.len); + if (ret < 0) { + __free_page(pages[i]); + break; + } + partial[i].offset = 0; + partial[i].len = iter->seq.len; + + trace_seq_reset(&iter->seq); + } + + mutex_unlock(&trace_types_lock); + + spd.nr_pages = i; + + return splice_to_pipe(pipe, &spd); + +out_err: + mutex_unlock(&trace_types_lock); + + return ret; +} + static ssize_t tracing_entries_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) @@ -3455,6 +2842,7 @@ static struct file_operations tracing_pi .open = tracing_open_pipe, .poll = tracing_poll_pipe, .read = tracing_read_pipe, + .splice_read = tracing_splice_read_pipe, .release = tracing_release_pipe, }; @@ -3624,7 +3012,7 @@ static __init int tracer_init_debugfs(vo int trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args) { - static DEFINE_SPINLOCK(trace_buf_lock); + static DEFINE_RAW_SPINLOCK(trace_buf_lock); static char trace_buf[TRACE_BUF_SIZE]; struct ring_buffer_event *event; @@ -3653,18 +3041,16 @@ int trace_vprintk(unsigned long ip, int trace_buf[len] = 0; size = sizeof(*entry) + len + 1; - event = ring_buffer_lock_reserve(tr->buffer, size, &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_PRINT, size, irq_flags, pc); if (!event) goto out_unlock; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, irq_flags, pc); - entry->ent.type = TRACE_PRINT; entry->ip = ip; entry->depth = depth; memcpy(&entry->buf, trace_buf, len); entry->buf[len] = 0; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); + ring_buffer_unlock_commit(tr->buffer, event); out_unlock: spin_unlock_irqrestore(&trace_buf_lock, irq_flags); @@ -3691,6 +3077,15 @@ int __ftrace_printk(unsigned long ip, co } EXPORT_SYMBOL_GPL(__ftrace_printk); +int __ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap) +{ + if (!(trace_flags & TRACE_ITER_PRINTK)) + return 0; + + return trace_vprintk(ip, task_curr_ret_stack(current), fmt, ap); +} +EXPORT_SYMBOL_GPL(__ftrace_vprintk); + static int trace_panic_handler(struct notifier_block *this, unsigned long event, void *unused) { @@ -3755,7 +3150,7 @@ trace_printk_seq(struct trace_seq *s) void ftrace_dump(void) { - static DEFINE_SPINLOCK(ftrace_dump_lock); + static DEFINE_RAW_SPINLOCK(ftrace_dump_lock); /* use static because iter can be a bit big for the stack */ static struct trace_iterator iter; static int dump_ran; @@ -3871,14 +3266,10 @@ __init static int tracer_alloc_buffers(v trace_init_cmdlines(); register_tracer(&nop_trace); + current_trace = &nop_trace; #ifdef CONFIG_BOOT_TRACER register_tracer(&boot_tracer); - current_trace = &boot_tracer; - current_trace->init(&global_trace); -#else - current_trace = &nop_trace; #endif - /* All seems OK, enable tracing */ tracing_disabled = 0; @@ -3895,5 +3286,26 @@ out_free_buffer_mask: out: return ret; } + +__init static int clear_boot_tracer(void) +{ + /* + * The default tracer at boot buffer is an init section. + * This function is called in lateinit. If we did not + * find the boot tracer, then clear it out, to prevent + * later registration from accessing the buffer that is + * about to be freed. + */ + if (!default_bootup_tracer) + return 0; + + printk(KERN_INFO "ftrace bootup tracer '%s' not registered.\n", + default_bootup_tracer); + default_bootup_tracer = NULL; + + return 0; +} + early_initcall(tracer_alloc_buffers); fs_initcall(tracer_init_debugfs); +late_initcall(clear_boot_tracer); Index: linux-2.6-tip/kernel/trace/trace.h =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace.h +++ linux-2.6-tip/kernel/trace/trace.h @@ -9,6 +9,8 @@ #include #include #include +#include +#include enum trace_type { __TRACE_FIRST_TYPE = 0, @@ -16,7 +18,6 @@ enum trace_type { TRACE_FN, TRACE_CTX, TRACE_WAKE, - TRACE_CONT, TRACE_STACK, TRACE_PRINT, TRACE_SPECIAL, @@ -29,9 +30,12 @@ enum trace_type { TRACE_GRAPH_ENT, TRACE_USER_STACK, TRACE_HW_BRANCHES, + TRACE_KMEM_ALLOC, + TRACE_KMEM_FREE, TRACE_POWER, + TRACE_BLK, - __TRACE_LAST_TYPE + __TRACE_LAST_TYPE, }; /* @@ -42,7 +46,6 @@ enum trace_type { */ struct trace_entry { unsigned char type; - unsigned char cpu; unsigned char flags; unsigned char preempt_count; int pid; @@ -60,13 +63,13 @@ struct ftrace_entry { /* Function call entry */ struct ftrace_graph_ent_entry { - struct trace_entry ent; + struct trace_entry ent; struct ftrace_graph_ent graph_ent; }; /* Function return entry */ struct ftrace_graph_ret_entry { - struct trace_entry ent; + struct trace_entry ent; struct ftrace_graph_ret ret; }; extern struct tracer boot_tracer; @@ -170,6 +173,24 @@ struct trace_power { struct power_trace state_data; }; +struct kmemtrace_alloc_entry { + struct trace_entry ent; + enum kmemtrace_type_id type_id; + unsigned long call_site; + const void *ptr; + size_t bytes_req; + size_t bytes_alloc; + gfp_t gfp_flags; + int node; +}; + +struct kmemtrace_free_entry { + struct trace_entry ent; + enum kmemtrace_type_id type_id; + unsigned long call_site; + const void *ptr; +}; + /* * trace_flag_type is an enumeration that holds different * states when a trace occurs. These are: @@ -178,7 +199,6 @@ struct trace_power { * NEED_RESCED - reschedule is requested * HARDIRQ - inside an interrupt handler * SOFTIRQ - inside a softirq handler - * CONT - multiple entries hold the trace item */ enum trace_flag_type { TRACE_FLAG_IRQS_OFF = 0x01, @@ -186,7 +206,6 @@ enum trace_flag_type { TRACE_FLAG_NEED_RESCHED = 0x04, TRACE_FLAG_HARDIRQ = 0x08, TRACE_FLAG_SOFTIRQ = 0x10, - TRACE_FLAG_CONT = 0x20, }; #define TRACE_BUF_SIZE 1024 @@ -262,7 +281,6 @@ extern void __ftrace_bad_type(void); do { \ IF_ASSIGN(var, ent, struct ftrace_entry, TRACE_FN); \ IF_ASSIGN(var, ent, struct ctx_switch_entry, 0); \ - IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \ IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK); \ IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\ IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \ @@ -280,6 +298,10 @@ extern void __ftrace_bad_type(void); TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct hw_branch_entry, TRACE_HW_BRANCHES);\ IF_ASSIGN(var, ent, struct trace_power, TRACE_POWER); \ + IF_ASSIGN(var, ent, struct kmemtrace_alloc_entry, \ + TRACE_KMEM_ALLOC); \ + IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \ + TRACE_KMEM_FREE); \ __ftrace_bad_type(); \ } while (0) @@ -287,7 +309,8 @@ extern void __ftrace_bad_type(void); enum print_line_t { TRACE_TYPE_PARTIAL_LINE = 0, /* Retry after flushing the seq */ TRACE_TYPE_HANDLED = 1, - TRACE_TYPE_UNHANDLED = 2 /* Relay to other output functions */ + TRACE_TYPE_UNHANDLED = 2, /* Relay to other output functions */ + TRACE_TYPE_NO_CONSUME = 3 /* Handled but ask to not consume */ }; @@ -313,22 +336,45 @@ struct tracer_flags { /* Makes more easy to define a tracer opt */ #define TRACER_OPT(s, b) .name = #s, .bit = b -/* - * A specific tracer, represented by methods that operate on a trace array: + +/** + * struct tracer - a specific tracer and its callbacks to interact with debugfs + * @name: the name chosen to select it on the available_tracers file + * @init: called when one switches to this tracer (echo name > current_tracer) + * @reset: called when one switches to another tracer + * @start: called when tracing is unpaused (echo 1 > tracing_enabled) + * @stop: called when tracing is paused (echo 0 > tracing_enabled) + * @open: called when the trace file is opened + * @pipe_open: called when the trace_pipe file is opened + * @wait_pipe: override how the user waits for traces on trace_pipe + * @close: called when the trace file is released + * @read: override the default read callback on trace_pipe + * @splice_read: override the default splice_read callback on trace_pipe + * @selftest: selftest to run on boot (see trace_selftest.c) + * @print_headers: override the first lines that describe your columns + * @print_line: callback that prints a trace + * @set_flag: signals one of your private flags changed (trace_options file) + * @flags: your private flags */ struct tracer { const char *name; - /* Your tracer should raise a warning if init fails */ int (*init)(struct trace_array *tr); void (*reset)(struct trace_array *tr); void (*start)(struct trace_array *tr); void (*stop)(struct trace_array *tr); void (*open)(struct trace_iterator *iter); void (*pipe_open)(struct trace_iterator *iter); + void (*wait_pipe)(struct trace_iterator *iter); void (*close)(struct trace_iterator *iter); ssize_t (*read)(struct trace_iterator *iter, struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos); + ssize_t (*splice_read)(struct trace_iterator *iter, + struct file *filp, + loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, + unsigned int flags); #ifdef CONFIG_FTRACE_STARTUP_TEST int (*selftest)(struct tracer *trace, struct trace_array *tr); @@ -340,6 +386,7 @@ struct tracer { struct tracer *next; int print_max; struct tracer_flags *flags; + struct tracer_stat *stats; }; struct trace_seq { @@ -371,6 +418,7 @@ struct trace_iterator { cpumask_var_t started; }; +int tracer_init(struct tracer *t, struct trace_array *tr); int tracing_is_enabled(void); void trace_wake_up(void); void tracing_reset(struct trace_array *tr, int cpu); @@ -379,26 +427,42 @@ int tracing_open_generic(struct inode *i struct dentry *tracing_init_dentry(void); void init_tracer_sysprof_debugfs(struct dentry *d_tracer); +struct ring_buffer_event; + +struct ring_buffer_event *trace_buffer_lock_reserve(struct trace_array *tr, + unsigned char type, + unsigned long len, + unsigned long flags, + int pc); +void trace_buffer_unlock_commit(struct trace_array *tr, + struct ring_buffer_event *event, + unsigned long flags, int pc); + struct trace_entry *tracing_get_trace_entry(struct trace_array *tr, struct trace_array_cpu *data); + +struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, + int *ent_cpu, u64 *ent_ts); + void tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags, int pc); +void default_wait_pipe(struct trace_iterator *iter); +void poll_wait_pipe(struct trace_iterator *iter); + void ftrace(struct trace_array *tr, struct trace_array_cpu *data, unsigned long ip, unsigned long parent_ip, unsigned long flags, int pc); void tracing_sched_switch_trace(struct trace_array *tr, - struct trace_array_cpu *data, struct task_struct *prev, struct task_struct *next, unsigned long flags, int pc); void tracing_record_cmdline(struct task_struct *tsk); void tracing_sched_wakeup_trace(struct trace_array *tr, - struct trace_array_cpu *data, struct task_struct *wakee, struct task_struct *cur, unsigned long flags, int pc); @@ -408,14 +472,12 @@ void trace_special(struct trace_array *t unsigned long arg2, unsigned long arg3, int pc); void trace_function(struct trace_array *tr, - struct trace_array_cpu *data, unsigned long ip, unsigned long parent_ip, unsigned long flags, int pc); void trace_graph_return(struct ftrace_graph_ret *trace); int trace_graph_entry(struct ftrace_graph_ent *trace); -void trace_hw_branch(struct trace_array *tr, u64 from, u64 to); void tracing_start_cmdline_record(void); void tracing_stop_cmdline_record(void); @@ -434,15 +496,11 @@ void update_max_tr(struct trace_array *t void update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu); -extern cycle_t ftrace_now(int cpu); +void __trace_stack(struct trace_array *tr, + unsigned long flags, + int skip, int pc); -#ifdef CONFIG_FUNCTION_TRACER -void tracing_start_function_trace(void); -void tracing_stop_function_trace(void); -#else -# define tracing_start_function_trace() do { } while (0) -# define tracing_stop_function_trace() do { } while (0) -#endif +extern cycle_t ftrace_now(int cpu); #ifdef CONFIG_CONTEXT_SWITCH_TRACER typedef void @@ -456,10 +514,10 @@ struct tracer_switch_ops { void *private; struct tracer_switch_ops *next; }; - -char *trace_find_cmdline(int pid); #endif /* CONFIG_CONTEXT_SWITCH_TRACER */ +extern char *trace_find_cmdline(int pid); + #ifdef CONFIG_DYNAMIC_FTRACE extern unsigned long ftrace_update_tot_cnt; #define DYN_FTRACE_TEST_NAME trace_selftest_dynamic_test_func @@ -469,6 +527,8 @@ extern int DYN_FTRACE_TEST_NAME(void); #ifdef CONFIG_FTRACE_STARTUP_TEST extern int trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr); +extern int trace_selftest_startup_function_graph(struct tracer *trace, + struct trace_array *tr); extern int trace_selftest_startup_irqsoff(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_preemptoff(struct tracer *trace, @@ -488,15 +548,6 @@ extern int trace_selftest_startup_branch #endif /* CONFIG_FTRACE_STARTUP_TEST */ extern void *head_page(struct trace_array_cpu *data); -extern int trace_seq_printf(struct trace_seq *s, const char *fmt, ...); -extern void trace_seq_print_cont(struct trace_seq *s, - struct trace_iterator *iter); - -extern int -seq_print_ip_sym(struct trace_seq *s, unsigned long ip, - unsigned long sym_flags); -extern ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, - size_t cnt); extern long ns2usecs(cycle_t nsec); extern int trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args); @@ -580,7 +631,8 @@ enum trace_iterator_flags { TRACE_ITER_ANNOTATE = 0x2000, TRACE_ITER_USERSTACKTRACE = 0x4000, TRACE_ITER_SYM_USEROBJ = 0x8000, - TRACE_ITER_PRINTK_MSGONLY = 0x10000 + TRACE_ITER_PRINTK_MSGONLY = 0x10000, + TRACE_ITER_CONTEXT_INFO = 0x20000 /* Print pid/cpu/time */ }; /* @@ -601,12 +653,12 @@ extern struct tracer nop_trace; * preempt_enable (after a disable), a schedule might take place * causing an infinite recursion. * - * To prevent this, we read the need_recshed flag before + * To prevent this, we read the need_resched flag before * disabling preemption. When we want to enable preemption we * check the flag, if it is set, then we call preempt_enable_no_resched. * Otherwise, we call preempt_enable. * - * The rational for doing the above is that if need resched is set + * The rational for doing the above is that if need_resched is set * and we have yet to reschedule, we are either in an atomic location * (where we do not need to check for scheduling) or we are inside * the scheduler and do not want to resched. @@ -627,7 +679,7 @@ static inline int ftrace_preempt_disable * * This is a scheduler safe way to enable preemption and not miss * any preemption checks. The disabled saved the state of preemption. - * If resched is set, then we were either inside an atomic or + * If resched is set, then we are either inside an atomic or * are inside the scheduler (we would have already scheduled * otherwise). In this case, we do not want to call normal * preempt_enable, but preempt_enable_no_resched instead. Index: linux-2.6-tip/kernel/trace/trace_boot.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_boot.c +++ linux-2.6-tip/kernel/trace/trace_boot.c @@ -11,6 +11,7 @@ #include #include "trace.h" +#include "trace_output.h" static struct trace_array *boot_trace; static bool pre_initcalls_finished; @@ -27,13 +28,13 @@ void start_boot_trace(void) void enable_boot_trace(void) { - if (pre_initcalls_finished) + if (boot_trace && pre_initcalls_finished) tracing_start_sched_switch_record(); } void disable_boot_trace(void) { - if (pre_initcalls_finished) + if (boot_trace && pre_initcalls_finished) tracing_stop_sched_switch_record(); } @@ -42,6 +43,9 @@ static int boot_trace_init(struct trace_ int cpu; boot_trace = tr; + if (!tr) + return 0; + for_each_cpu(cpu, cpu_possible_mask) tracing_reset(tr, cpu); @@ -128,10 +132,9 @@ void trace_boot_call(struct boot_trace_c { struct ring_buffer_event *event; struct trace_boot_call *entry; - unsigned long irq_flags; struct trace_array *tr = boot_trace; - if (!pre_initcalls_finished) + if (!tr || !pre_initcalls_finished) return; /* Get its name now since this function could @@ -140,18 +143,13 @@ void trace_boot_call(struct boot_trace_c sprint_symbol(bt->func, (unsigned long)fn); preempt_disable(); - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_BOOT_CALL, + sizeof(*entry), 0, 0); if (!event) goto out; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, 0); - entry->ent.type = TRACE_BOOT_CALL; entry->boot_call = *bt; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); - + trace_buffer_unlock_commit(tr, event, 0, 0); out: preempt_enable(); } @@ -160,27 +158,21 @@ void trace_boot_ret(struct boot_trace_re { struct ring_buffer_event *event; struct trace_boot_ret *entry; - unsigned long irq_flags; struct trace_array *tr = boot_trace; - if (!pre_initcalls_finished) + if (!tr || !pre_initcalls_finished) return; sprint_symbol(bt->func, (unsigned long)fn); preempt_disable(); - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_BOOT_RET, + sizeof(*entry), 0, 0); if (!event) goto out; entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, 0); - entry->ent.type = TRACE_BOOT_RET; entry->boot_ret = *bt; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); - + trace_buffer_unlock_commit(tr, event, 0, 0); out: preempt_enable(); } Index: linux-2.6-tip/kernel/trace/trace_branch.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_branch.c +++ linux-2.6-tip/kernel/trace/trace_branch.c @@ -14,12 +14,17 @@ #include #include #include + #include "trace.h" +#include "trace_stat.h" +#include "trace_output.h" #ifdef CONFIG_BRANCH_TRACER +static struct tracer branch_trace; static int branch_tracing_enabled __read_mostly; static DEFINE_MUTEX(branch_tracing_mutex); + static struct trace_array *branch_tracer; static void @@ -28,7 +33,7 @@ probe_likely_condition(struct ftrace_bra struct trace_array *tr = branch_tracer; struct ring_buffer_event *event; struct trace_branch *entry; - unsigned long flags, irq_flags; + unsigned long flags; int cpu, pc; const char *p; @@ -47,15 +52,13 @@ probe_likely_condition(struct ftrace_bra if (atomic_inc_return(&tr->data[cpu]->disabled) != 1) goto out; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + pc = preempt_count(); + event = trace_buffer_lock_reserve(tr, TRACE_BRANCH, + sizeof(*entry), flags, pc); if (!event) goto out; - pc = preempt_count(); entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, flags, pc); - entry->ent.type = TRACE_BRANCH; /* Strip off the path, only save the file */ p = f->file + strlen(f->file); @@ -70,7 +73,7 @@ probe_likely_condition(struct ftrace_bra entry->line = f->line; entry->correct = val == expect; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); + ring_buffer_unlock_commit(tr->buffer, event); out: atomic_dec(&tr->data[cpu]->disabled); @@ -88,8 +91,6 @@ void trace_likely_condition(struct ftrac int enable_branch_tracing(struct trace_array *tr) { - int ret = 0; - mutex_lock(&branch_tracing_mutex); branch_tracer = tr; /* @@ -100,7 +101,7 @@ int enable_branch_tracing(struct trace_a branch_tracing_enabled++; mutex_unlock(&branch_tracing_mutex); - return ret; + return 0; } void disable_branch_tracing(void) @@ -128,11 +129,6 @@ static void stop_branch_trace(struct tra static int branch_trace_init(struct trace_array *tr) { - int cpu; - - for_each_online_cpu(cpu) - tracing_reset(tr, cpu); - start_branch_trace(tr); return 0; } @@ -142,22 +138,54 @@ static void branch_trace_reset(struct tr stop_branch_trace(tr); } -struct tracer branch_trace __read_mostly = +static enum print_line_t trace_branch_print(struct trace_iterator *iter, + int flags) +{ + struct trace_branch *field; + + trace_assign_type(field, iter->ent); + + if (trace_seq_printf(&iter->seq, "[%s] %s:%s:%d\n", + field->correct ? " ok " : " MISS ", + field->func, + field->file, + field->line)) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + + +static struct trace_event trace_branch_event = { + .type = TRACE_BRANCH, + .trace = trace_branch_print, + .latency_trace = trace_branch_print, +}; + +static struct tracer branch_trace __read_mostly = { .name = "branch", .init = branch_trace_init, .reset = branch_trace_reset, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_branch, -#endif +#endif /* CONFIG_FTRACE_SELFTEST */ }; -__init static int init_branch_trace(void) +__init static int init_branch_tracer(void) { + int ret; + + ret = register_ftrace_event(&trace_branch_event); + if (!ret) { + printk(KERN_WARNING "Warning: could not register " + "branch events\n"); + return 1; + } return register_tracer(&branch_trace); } +device_initcall(init_branch_tracer); -device_initcall(init_branch_trace); #else static inline void trace_likely_condition(struct ftrace_branch_data *f, int val, int expect) @@ -183,66 +211,39 @@ void ftrace_likely_update(struct ftrace_ } EXPORT_SYMBOL(ftrace_likely_update); -struct ftrace_pointer { - void *start; - void *stop; - int hit; -}; +extern unsigned long __start_annotated_branch_profile[]; +extern unsigned long __stop_annotated_branch_profile[]; -static void * -t_next(struct seq_file *m, void *v, loff_t *pos) +static int annotated_branch_stat_headers(struct seq_file *m) { - const struct ftrace_pointer *f = m->private; - struct ftrace_branch_data *p = v; - - (*pos)++; - - if (v == (void *)1) - return f->start; - - ++p; - - if ((void *)p >= (void *)f->stop) - return NULL; - - return p; + seq_printf(m, " correct incorrect %% "); + seq_printf(m, " Function " + " File Line\n" + " ------- --------- - " + " -------- " + " ---- ----\n"); + return 0; } -static void *t_start(struct seq_file *m, loff_t *pos) +static inline long get_incorrect_percent(struct ftrace_branch_data *p) { - void *t = (void *)1; - loff_t l = 0; - - for (; t && l < *pos; t = t_next(m, t, &l)) - ; + long percent; - return t; -} + if (p->correct) { + percent = p->incorrect * 100; + percent /= p->correct + p->incorrect; + } else + percent = p->incorrect ? 100 : -1; -static void t_stop(struct seq_file *m, void *p) -{ + return percent; } -static int t_show(struct seq_file *m, void *v) +static int branch_stat_show(struct seq_file *m, void *v) { - const struct ftrace_pointer *fp = m->private; struct ftrace_branch_data *p = v; const char *f; long percent; - if (v == (void *)1) { - if (fp->hit) - seq_printf(m, " miss hit %% "); - else - seq_printf(m, " correct incorrect %% "); - seq_printf(m, " Function " - " File Line\n" - " ------- --------- - " - " -------- " - " ---- ----\n"); - return 0; - } - /* Only print the file, not the path */ f = p->file + strlen(p->file); while (f >= p->file && *f != '/') @@ -252,11 +253,7 @@ static int t_show(struct seq_file *m, vo /* * The miss is overlayed on correct, and hit on incorrect. */ - if (p->correct) { - percent = p->incorrect * 100; - percent /= p->correct + p->incorrect; - } else - percent = p->incorrect ? 100 : -1; + percent = get_incorrect_percent(p); seq_printf(m, "%8lu %8lu ", p->correct, p->incorrect); if (percent < 0) @@ -267,76 +264,118 @@ static int t_show(struct seq_file *m, vo return 0; } -static struct seq_operations tracing_likely_seq_ops = { - .start = t_start, - .next = t_next, - .stop = t_stop, - .show = t_show, +static void *annotated_branch_stat_start(void) +{ + return __start_annotated_branch_profile; +} + +static void * +annotated_branch_stat_next(void *v, int idx) +{ + struct ftrace_branch_data *p = v; + + ++p; + + if ((void *)p >= (void *)__stop_annotated_branch_profile) + return NULL; + + return p; +} + +static int annotated_branch_stat_cmp(void *p1, void *p2) +{ + struct ftrace_branch_data *a = p1; + struct ftrace_branch_data *b = p2; + + long percent_a, percent_b; + + percent_a = get_incorrect_percent(a); + percent_b = get_incorrect_percent(b); + + if (percent_a < percent_b) + return -1; + if (percent_a > percent_b) + return 1; + else + return 0; +} + +static struct tracer_stat annotated_branch_stats = { + .name = "branch_annotated", + .stat_start = annotated_branch_stat_start, + .stat_next = annotated_branch_stat_next, + .stat_cmp = annotated_branch_stat_cmp, + .stat_headers = annotated_branch_stat_headers, + .stat_show = branch_stat_show }; -static int tracing_branch_open(struct inode *inode, struct file *file) +__init static int init_annotated_branch_stats(void) { int ret; - ret = seq_open(file, &tracing_likely_seq_ops); + ret = register_stat_tracer(&annotated_branch_stats); if (!ret) { - struct seq_file *m = file->private_data; - m->private = (void *)inode->i_private; + printk(KERN_WARNING "Warning: could not register " + "annotated branches stats\n"); + return 1; } - - return ret; + return 0; } - -static const struct file_operations tracing_branch_fops = { - .open = tracing_branch_open, - .read = seq_read, - .llseek = seq_lseek, -}; +fs_initcall(init_annotated_branch_stats); #ifdef CONFIG_PROFILE_ALL_BRANCHES + extern unsigned long __start_branch_profile[]; extern unsigned long __stop_branch_profile[]; -static const struct ftrace_pointer ftrace_branch_pos = { - .start = __start_branch_profile, - .stop = __stop_branch_profile, - .hit = 1, -}; +static int all_branch_stat_headers(struct seq_file *m) +{ + seq_printf(m, " miss hit %% "); + seq_printf(m, " Function " + " File Line\n" + " ------- --------- - " + " -------- " + " ---- ----\n"); + return 0; +} -#endif /* CONFIG_PROFILE_ALL_BRANCHES */ +static void *all_branch_stat_start(void) +{ + return __start_branch_profile; +} -extern unsigned long __start_annotated_branch_profile[]; -extern unsigned long __stop_annotated_branch_profile[]; +static void * +all_branch_stat_next(void *v, int idx) +{ + struct ftrace_branch_data *p = v; -static const struct ftrace_pointer ftrace_annotated_branch_pos = { - .start = __start_annotated_branch_profile, - .stop = __stop_annotated_branch_profile, -}; + ++p; -static __init int ftrace_branch_init(void) -{ - struct dentry *d_tracer; - struct dentry *entry; + if ((void *)p >= (void *)__stop_branch_profile) + return NULL; - d_tracer = tracing_init_dentry(); + return p; +} - entry = debugfs_create_file("profile_annotated_branch", 0444, d_tracer, - (void *)&ftrace_annotated_branch_pos, - &tracing_branch_fops); - if (!entry) - pr_warning("Could not create debugfs " - "'profile_annotatet_branch' entry\n"); +static struct tracer_stat all_branch_stats = { + .name = "branch_all", + .stat_start = all_branch_stat_start, + .stat_next = all_branch_stat_next, + .stat_headers = all_branch_stat_headers, + .stat_show = branch_stat_show +}; -#ifdef CONFIG_PROFILE_ALL_BRANCHES - entry = debugfs_create_file("profile_branch", 0444, d_tracer, - (void *)&ftrace_branch_pos, - &tracing_branch_fops); - if (!entry) - pr_warning("Could not create debugfs" - " 'profile_branch' entry\n"); -#endif +__init static int all_annotated_branch_stats(void) +{ + int ret; + ret = register_stat_tracer(&all_branch_stats); + if (!ret) { + printk(KERN_WARNING "Warning: could not register " + "all branches stats\n"); + return 1; + } return 0; } - -device_initcall(ftrace_branch_init); +fs_initcall(all_annotated_branch_stats); +#endif /* CONFIG_PROFILE_ALL_BRANCHES */ Index: linux-2.6-tip/kernel/trace/trace_functions.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_functions.c +++ linux-2.6-tip/kernel/trace/trace_functions.c @@ -9,6 +9,7 @@ * Copyright (C) 2004-2006 Ingo Molnar * Copyright (C) 2004 William Lee Irwin III */ +#include #include #include #include @@ -16,52 +17,388 @@ #include "trace.h" -static void start_function_trace(struct trace_array *tr) +/* function tracing enabled */ +static int ftrace_function_enabled; + +static struct trace_array *func_trace; + +static void tracing_start_function_trace(void); +static void tracing_stop_function_trace(void); + +static int function_trace_init(struct trace_array *tr) { + func_trace = tr; tr->cpu = get_cpu(); - tracing_reset_online_cpus(tr); put_cpu(); tracing_start_cmdline_record(); tracing_start_function_trace(); + return 0; } -static void stop_function_trace(struct trace_array *tr) +static void function_trace_reset(struct trace_array *tr) { tracing_stop_function_trace(); tracing_stop_cmdline_record(); } -static int function_trace_init(struct trace_array *tr) +static void function_trace_start(struct trace_array *tr) { - start_function_trace(tr); - return 0; + tracing_reset_online_cpus(tr); } -static void function_trace_reset(struct trace_array *tr) +static void +function_trace_call_preempt_only(unsigned long ip, unsigned long parent_ip) +{ + struct trace_array *tr = func_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu, resched; + int pc; + + if (unlikely(!ftrace_function_enabled)) + return; + + pc = preempt_count(); + resched = ftrace_preempt_disable(); + local_save_flags(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) + trace_function(tr, ip, parent_ip, flags, pc); + + atomic_dec(&data->disabled); + ftrace_preempt_enable(resched); +} + +static void +function_trace_call(unsigned long ip, unsigned long parent_ip) { - stop_function_trace(tr); + struct trace_array *tr = func_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + int pc; + + if (unlikely(!ftrace_function_enabled)) + return; + + /* + * Need to use raw, since this must be called before the + * recursive protection is performed. + */ + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) { + pc = preempt_count(); + trace_function(tr, ip, parent_ip, flags, pc); + } + + atomic_dec(&data->disabled); + local_irq_restore(flags); } -static void function_trace_start(struct trace_array *tr) +static void +function_stack_trace_call(unsigned long ip, unsigned long parent_ip) { - tracing_reset_online_cpus(tr); + struct trace_array *tr = func_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + int pc; + + if (unlikely(!ftrace_function_enabled)) + return; + + /* + * Need to use raw, since this must be called before the + * recursive protection is performed. + */ + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) { + pc = preempt_count(); + trace_function(tr, ip, parent_ip, flags, pc); + /* + * skip over 5 funcs: + * __ftrace_trace_stack, + * __trace_stack, + * function_stack_trace_call + * ftrace_list_func + * ftrace_call + */ + __trace_stack(tr, flags, 5, pc); + } + + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + + +static struct ftrace_ops trace_ops __read_mostly = +{ + .func = function_trace_call, +}; + +static struct ftrace_ops trace_stack_ops __read_mostly = +{ + .func = function_stack_trace_call, +}; + +/* Our two options */ +enum { + TRACE_FUNC_OPT_STACK = 0x1, +}; + +static struct tracer_opt func_opts[] = { +#ifdef CONFIG_STACKTRACE + { TRACER_OPT(func_stack_trace, TRACE_FUNC_OPT_STACK) }, +#endif + { } /* Always set a last empty entry */ +}; + +static struct tracer_flags func_flags = { + .val = 0, /* By default: all flags disabled */ + .opts = func_opts +}; + +static void tracing_start_function_trace(void) +{ + ftrace_function_enabled = 0; + + if (trace_flags & TRACE_ITER_PREEMPTONLY) + trace_ops.func = function_trace_call_preempt_only; + else + trace_ops.func = function_trace_call; + + if (func_flags.val & TRACE_FUNC_OPT_STACK) + register_ftrace_function(&trace_stack_ops); + else + register_ftrace_function(&trace_ops); + + ftrace_function_enabled = 1; +} + +static void tracing_stop_function_trace(void) +{ + ftrace_function_enabled = 0; + /* OK if they are not registered */ + unregister_ftrace_function(&trace_stack_ops); + unregister_ftrace_function(&trace_ops); +} + +static int func_set_flag(u32 old_flags, u32 bit, int set) +{ + if (bit == TRACE_FUNC_OPT_STACK) { + /* do nothing if already set */ + if (!!set == !!(func_flags.val & TRACE_FUNC_OPT_STACK)) + return 0; + + if (set) { + unregister_ftrace_function(&trace_ops); + register_ftrace_function(&trace_stack_ops); + } else { + unregister_ftrace_function(&trace_stack_ops); + register_ftrace_function(&trace_ops); + } + + return 0; + } + + return -EINVAL; } static struct tracer function_trace __read_mostly = { - .name = "function", - .init = function_trace_init, - .reset = function_trace_reset, - .start = function_trace_start, + .name = "function", + .init = function_trace_init, + .reset = function_trace_reset, + .start = function_trace_start, + .wait_pipe = poll_wait_pipe, + .flags = &func_flags, + .set_flag = func_set_flag, #ifdef CONFIG_FTRACE_SELFTEST - .selftest = trace_selftest_startup_function, + .selftest = trace_selftest_startup_function, #endif }; +#ifdef CONFIG_DYNAMIC_FTRACE +static void +ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data) +{ + long *count = (long *)data; + + if (tracing_is_on()) + return; + + if (!*count) + return; + + if (*count != -1) + (*count)--; + + tracing_on(); +} + +static void +ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data) +{ + long *count = (long *)data; + + if (!tracing_is_on()) + return; + + if (!*count) + return; + + if (*count != -1) + (*count)--; + + tracing_off(); +} + +static int +ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip, + struct ftrace_probe_ops *ops, void *data); + +static struct ftrace_probe_ops traceon_probe_ops = { + .func = ftrace_traceon, + .print = ftrace_trace_onoff_print, +}; + +static struct ftrace_probe_ops traceoff_probe_ops = { + .func = ftrace_traceoff, + .print = ftrace_trace_onoff_print, +}; + +static int +ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip, + struct ftrace_probe_ops *ops, void *data) +{ + char str[KSYM_SYMBOL_LEN]; + long count = (long)data; + + kallsyms_lookup(ip, NULL, NULL, NULL, str); + seq_printf(m, "%s:", str); + + if (ops == &traceon_probe_ops) + seq_printf(m, "traceon"); + else + seq_printf(m, "traceoff"); + + if (count == -1) + seq_printf(m, ":unlimited\n"); + else + seq_printf(m, ":count=%ld", count); + seq_putc(m, '\n'); + + return 0; +} + +static int +ftrace_trace_onoff_unreg(char *glob, char *cmd, char *param) +{ + struct ftrace_probe_ops *ops; + + /* we register both traceon and traceoff to this callback */ + if (strcmp(cmd, "traceon") == 0) + ops = &traceon_probe_ops; + else + ops = &traceoff_probe_ops; + + unregister_ftrace_function_probe_func(glob, ops); + + return 0; +} + +static int +ftrace_trace_onoff_callback(char *glob, char *cmd, char *param, int enable) +{ + struct ftrace_probe_ops *ops; + void *count = (void *)-1; + char *number; + int ret; + + /* hash funcs only work with set_ftrace_filter */ + if (!enable) + return -EINVAL; + + if (glob[0] == '!') + return ftrace_trace_onoff_unreg(glob+1, cmd, param); + + /* we register both traceon and traceoff to this callback */ + if (strcmp(cmd, "traceon") == 0) + ops = &traceon_probe_ops; + else + ops = &traceoff_probe_ops; + + if (!param) + goto out_reg; + + number = strsep(¶m, ":"); + + if (!strlen(number)) + goto out_reg; + + /* + * We use the callback data field (which is a pointer) + * as our counter. + */ + ret = strict_strtoul(number, 0, (unsigned long *)&count); + if (ret) + return ret; + + out_reg: + ret = register_ftrace_function_probe(glob, ops, count); + + return ret; +} + +static struct ftrace_func_command ftrace_traceon_cmd = { + .name = "traceon", + .func = ftrace_trace_onoff_callback, +}; + +static struct ftrace_func_command ftrace_traceoff_cmd = { + .name = "traceoff", + .func = ftrace_trace_onoff_callback, +}; + +static int __init init_func_cmd_traceon(void) +{ + int ret; + + ret = register_ftrace_command(&ftrace_traceoff_cmd); + if (ret) + return ret; + + ret = register_ftrace_command(&ftrace_traceon_cmd); + if (ret) + unregister_ftrace_command(&ftrace_traceoff_cmd); + return ret; +} +#else +static inline int init_func_cmd_traceon(void) +{ + return 0; +} +#endif /* CONFIG_DYNAMIC_FTRACE */ + static __init int init_function_trace(void) { + init_func_cmd_traceon(); return register_tracer(&function_trace); } - device_initcall(init_function_trace); + Index: linux-2.6-tip/kernel/trace/trace_functions_graph.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_functions_graph.c +++ linux-2.6-tip/kernel/trace/trace_functions_graph.c @@ -1,7 +1,7 @@ /* * * Function graph tracer. - * Copyright (c) 2008 Frederic Weisbecker + * Copyright (c) 2008-2009 Frederic Weisbecker * Mostly borrowed from function tracer which * is Copyright (c) Steven Rostedt * @@ -12,6 +12,7 @@ #include #include "trace.h" +#include "trace_output.h" #define TRACE_GRAPH_INDENT 2 @@ -20,9 +21,11 @@ #define TRACE_GRAPH_PRINT_CPU 0x2 #define TRACE_GRAPH_PRINT_OVERHEAD 0x4 #define TRACE_GRAPH_PRINT_PROC 0x8 +#define TRACE_GRAPH_PRINT_DURATION 0x10 +#define TRACE_GRAPH_PRINT_ABS_TIME 0X20 static struct tracer_opt trace_opts[] = { - /* Display overruns ? */ + /* Display overruns? (for self-debug purpose) */ { TRACER_OPT(funcgraph-overrun, TRACE_GRAPH_PRINT_OVERRUN) }, /* Display CPU ? */ { TRACER_OPT(funcgraph-cpu, TRACE_GRAPH_PRINT_CPU) }, @@ -30,26 +33,101 @@ static struct tracer_opt trace_opts[] = { TRACER_OPT(funcgraph-overhead, TRACE_GRAPH_PRINT_OVERHEAD) }, /* Display proc name/pid */ { TRACER_OPT(funcgraph-proc, TRACE_GRAPH_PRINT_PROC) }, + /* Display duration of execution */ + { TRACER_OPT(funcgraph-duration, TRACE_GRAPH_PRINT_DURATION) }, + /* Display absolute time of an entry */ + { TRACER_OPT(funcgraph-abstime, TRACE_GRAPH_PRINT_ABS_TIME) }, { } /* Empty entry */ }; static struct tracer_flags tracer_flags = { /* Don't display overruns and proc by default */ - .val = TRACE_GRAPH_PRINT_CPU | TRACE_GRAPH_PRINT_OVERHEAD, + .val = TRACE_GRAPH_PRINT_CPU | TRACE_GRAPH_PRINT_OVERHEAD | + TRACE_GRAPH_PRINT_DURATION, .opts = trace_opts }; /* pid on the last trace processed */ -static pid_t last_pid[NR_CPUS] = { [0 ... NR_CPUS-1] = -1 }; -static int graph_trace_init(struct trace_array *tr) + +/* Add a function return address to the trace stack on thread info.*/ +int +ftrace_push_return_trace(unsigned long ret, unsigned long long time, + unsigned long func, int *depth) +{ + int index; + + if (!current->ret_stack) + return -EBUSY; + + /* The return trace stack is full */ + if (current->curr_ret_stack == FTRACE_RETFUNC_DEPTH - 1) { + atomic_inc(¤t->trace_overrun); + return -EBUSY; + } + + index = ++current->curr_ret_stack; + barrier(); + current->ret_stack[index].ret = ret; + current->ret_stack[index].func = func; + current->ret_stack[index].calltime = time; + *depth = index; + + return 0; +} + +/* Retrieve a function return address to the trace stack on thread info.*/ +void +ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret) +{ + int index; + + index = current->curr_ret_stack; + + if (unlikely(index < 0)) { + ftrace_graph_stop(); + WARN_ON(1); + /* Might as well panic, otherwise we have no where to go */ + *ret = (unsigned long)panic; + return; + } + + *ret = current->ret_stack[index].ret; + trace->func = current->ret_stack[index].func; + trace->calltime = current->ret_stack[index].calltime; + trace->overrun = atomic_read(¤t->trace_overrun); + trace->depth = index; + barrier(); + current->curr_ret_stack--; + +} + +/* + * Send the trace to the ring-buffer. + * @return the original return address. + */ +unsigned long ftrace_return_to_handler(void) { - int cpu, ret; + struct ftrace_graph_ret trace; + unsigned long ret; - for_each_online_cpu(cpu) - tracing_reset(tr, cpu); + ftrace_pop_return_trace(&trace, &ret); + trace.rettime = cpu_clock(raw_smp_processor_id()); + ftrace_graph_return(&trace); + + if (unlikely(!ret)) { + ftrace_graph_stop(); + WARN_ON(1); + /* Might as well panic. What else to do? */ + ret = (unsigned long)panic; + } + + return ret; +} - ret = register_ftrace_graph(&trace_graph_return, +static int graph_trace_init(struct trace_array *tr) +{ + int ret = register_ftrace_graph(&trace_graph_return, &trace_graph_entry); if (ret) return ret; @@ -153,17 +231,25 @@ print_graph_proc(struct trace_seq *s, pi /* If the pid changed since the last trace, output this event */ static enum print_line_t -verif_pid(struct trace_seq *s, pid_t pid, int cpu) +verif_pid(struct trace_seq *s, pid_t pid, int cpu, pid_t *last_pids_cpu) { pid_t prev_pid; + pid_t *last_pid; int ret; - if (last_pid[cpu] != -1 && last_pid[cpu] == pid) + if (!last_pids_cpu) return TRACE_TYPE_HANDLED; - prev_pid = last_pid[cpu]; - last_pid[cpu] = pid; + last_pid = per_cpu_ptr(last_pids_cpu, cpu); + if (*last_pid == pid) + return TRACE_TYPE_HANDLED; + + prev_pid = *last_pid; + *last_pid = pid; + + if (prev_pid == -1) + return TRACE_TYPE_HANDLED; /* * Context-switch trace line: @@ -175,34 +261,34 @@ verif_pid(struct trace_seq *s, pid_t pid ret = trace_seq_printf(s, " ------------------------------------------\n"); if (!ret) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; ret = print_graph_cpu(s, cpu); if (ret == TRACE_TYPE_PARTIAL_LINE) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; ret = print_graph_proc(s, prev_pid); if (ret == TRACE_TYPE_PARTIAL_LINE) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; ret = trace_seq_printf(s, " => "); if (!ret) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; ret = print_graph_proc(s, pid); if (ret == TRACE_TYPE_PARTIAL_LINE) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; ret = trace_seq_printf(s, "\n ------------------------------------------\n\n"); if (!ret) - TRACE_TYPE_PARTIAL_LINE; + return TRACE_TYPE_PARTIAL_LINE; - return ret; + return TRACE_TYPE_HANDLED; } -static bool -trace_branch_is_leaf(struct trace_iterator *iter, +static struct ftrace_graph_ret_entry * +get_return_for_leaf(struct trace_iterator *iter, struct ftrace_graph_ent_entry *curr) { struct ring_buffer_iter *ring_iter; @@ -211,65 +297,123 @@ trace_branch_is_leaf(struct trace_iterat ring_iter = iter->buffer_iter[iter->cpu]; - if (!ring_iter) - return false; - - event = ring_buffer_iter_peek(ring_iter, NULL); + /* First peek to compare current entry and the next one */ + if (ring_iter) + event = ring_buffer_iter_peek(ring_iter, NULL); + else { + /* We need to consume the current entry to see the next one */ + ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL); + event = ring_buffer_peek(iter->tr->buffer, iter->cpu, + NULL); + } if (!event) - return false; + return NULL; next = ring_buffer_event_data(event); if (next->ent.type != TRACE_GRAPH_RET) - return false; + return NULL; if (curr->ent.pid != next->ent.pid || curr->graph_ent.func != next->ret.func) - return false; + return NULL; + + /* this is a leaf, now advance the iterator */ + if (ring_iter) + ring_buffer_read(ring_iter, NULL); + + return next; +} + +/* Signal a overhead of time execution to the output */ +static int +print_graph_overhead(unsigned long long duration, struct trace_seq *s) +{ + /* If duration disappear, we don't need anything */ + if (!(tracer_flags.val & TRACE_GRAPH_PRINT_DURATION)) + return 1; + + /* Non nested entry or return */ + if (duration == -1) + return trace_seq_printf(s, " "); + + if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { + /* Duration exceeded 100 msecs */ + if (duration > 100000ULL) + return trace_seq_printf(s, "! "); + + /* Duration exceeded 10 msecs */ + if (duration > 10000ULL) + return trace_seq_printf(s, "+ "); + } + + return trace_seq_printf(s, " "); +} + +static int print_graph_abs_time(u64 t, struct trace_seq *s) +{ + unsigned long usecs_rem; + + usecs_rem = do_div(t, NSEC_PER_SEC); + usecs_rem /= 1000; - return true; + return trace_seq_printf(s, "%5lu.%06lu | ", + (unsigned long)t, usecs_rem); } static enum print_line_t -print_graph_irq(struct trace_seq *s, unsigned long addr, - enum trace_type type, int cpu, pid_t pid) +print_graph_irq(struct trace_iterator *iter, unsigned long addr, + enum trace_type type, int cpu, pid_t pid) { int ret; + struct trace_seq *s = &iter->seq; if (addr < (unsigned long)__irqentry_text_start || addr >= (unsigned long)__irqentry_text_end) return TRACE_TYPE_UNHANDLED; - if (type == TRACE_GRAPH_ENT) { - ret = trace_seq_printf(s, "==========> | "); - } else { - /* Cpu */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) { - ret = print_graph_cpu(s, cpu); - if (ret == TRACE_TYPE_PARTIAL_LINE) - return TRACE_TYPE_PARTIAL_LINE; - } - /* Proc */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_PROC) { - ret = print_graph_proc(s, pid); - if (ret == TRACE_TYPE_PARTIAL_LINE) - return TRACE_TYPE_PARTIAL_LINE; + /* Absolute time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) { + ret = print_graph_abs_time(iter->ts, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + } - ret = trace_seq_printf(s, " | "); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } + /* Cpu */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) { + ret = print_graph_cpu(s, cpu); + if (ret == TRACE_TYPE_PARTIAL_LINE) + return TRACE_TYPE_PARTIAL_LINE; + } + /* Proc */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_PROC) { + ret = print_graph_proc(s, pid); + if (ret == TRACE_TYPE_PARTIAL_LINE) + return TRACE_TYPE_PARTIAL_LINE; + ret = trace_seq_printf(s, " | "); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + } - /* No overhead */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - ret = trace_seq_printf(s, " "); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } + /* No overhead */ + ret = print_graph_overhead(-1, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + if (type == TRACE_GRAPH_ENT) + ret = trace_seq_printf(s, "==========>"); + else + ret = trace_seq_printf(s, "<=========="); + + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* Don't close the duration column if haven't one */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) + trace_seq_printf(s, " |"); + ret = trace_seq_printf(s, "\n"); - ret = trace_seq_printf(s, "<========== |\n"); - } if (!ret) return TRACE_TYPE_PARTIAL_LINE; return TRACE_TYPE_HANDLED; @@ -288,7 +432,7 @@ print_graph_duration(unsigned long long sprintf(msecs_str, "%lu", (unsigned long) duration); /* Print msecs */ - ret = trace_seq_printf(s, msecs_str); + ret = trace_seq_printf(s, "%s", msecs_str); if (!ret) return TRACE_TYPE_PARTIAL_LINE; @@ -321,51 +465,33 @@ print_graph_duration(unsigned long long } -/* Signal a overhead of time execution to the output */ -static int -print_graph_overhead(unsigned long long duration, struct trace_seq *s) -{ - /* Duration exceeded 100 msecs */ - if (duration > 100000ULL) - return trace_seq_printf(s, "! "); - - /* Duration exceeded 10 msecs */ - if (duration > 10000ULL) - return trace_seq_printf(s, "+ "); - - return trace_seq_printf(s, " "); -} - /* Case of a leaf function on its call entry */ static enum print_line_t print_graph_entry_leaf(struct trace_iterator *iter, - struct ftrace_graph_ent_entry *entry, struct trace_seq *s) + struct ftrace_graph_ent_entry *entry, + struct ftrace_graph_ret_entry *ret_entry, struct trace_seq *s) { - struct ftrace_graph_ret_entry *ret_entry; struct ftrace_graph_ret *graph_ret; - struct ring_buffer_event *event; struct ftrace_graph_ent *call; unsigned long long duration; int ret; int i; - event = ring_buffer_read(iter->buffer_iter[iter->cpu], NULL); - ret_entry = ring_buffer_event_data(event); graph_ret = &ret_entry->ret; call = &entry->graph_ent; duration = graph_ret->rettime - graph_ret->calltime; /* Overhead */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - ret = print_graph_overhead(duration, s); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } + ret = print_graph_overhead(duration, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; /* Duration */ - ret = print_graph_duration(duration, s); - if (ret == TRACE_TYPE_PARTIAL_LINE) - return TRACE_TYPE_PARTIAL_LINE; + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) { + ret = print_graph_duration(duration, s); + if (ret == TRACE_TYPE_PARTIAL_LINE) + return TRACE_TYPE_PARTIAL_LINE; + } /* Function */ for (i = 0; i < call->depth * TRACE_GRAPH_INDENT; i++) { @@ -394,25 +520,17 @@ print_graph_entry_nested(struct ftrace_g struct ftrace_graph_ent *call = &entry->graph_ent; /* No overhead */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - ret = trace_seq_printf(s, " "); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } + ret = print_graph_overhead(-1, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; - /* Interrupt */ - ret = print_graph_irq(s, call->func, TRACE_GRAPH_ENT, cpu, pid); - if (ret == TRACE_TYPE_UNHANDLED) { - /* No time */ + /* No time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) { ret = trace_seq_printf(s, " | "); if (!ret) return TRACE_TYPE_PARTIAL_LINE; - } else { - if (ret == TRACE_TYPE_PARTIAL_LINE) - return TRACE_TYPE_PARTIAL_LINE; } - /* Function */ for (i = 0; i < call->depth * TRACE_GRAPH_INDENT; i++) { ret = trace_seq_printf(s, " "); @@ -428,20 +546,40 @@ print_graph_entry_nested(struct ftrace_g if (!ret) return TRACE_TYPE_PARTIAL_LINE; - return TRACE_TYPE_HANDLED; + /* + * we already consumed the current entry to check the next one + * and see if this is a leaf. + */ + return TRACE_TYPE_NO_CONSUME; } static enum print_line_t print_graph_entry(struct ftrace_graph_ent_entry *field, struct trace_seq *s, - struct trace_iterator *iter, int cpu) + struct trace_iterator *iter) { int ret; + int cpu = iter->cpu; + pid_t *last_entry = iter->private; struct trace_entry *ent = iter->ent; + struct ftrace_graph_ent *call = &field->graph_ent; + struct ftrace_graph_ret_entry *leaf_ret; /* Pid */ - if (verif_pid(s, ent->pid, cpu) == TRACE_TYPE_PARTIAL_LINE) + if (verif_pid(s, ent->pid, cpu, last_entry) == TRACE_TYPE_PARTIAL_LINE) return TRACE_TYPE_PARTIAL_LINE; + /* Interrupt */ + ret = print_graph_irq(iter, call->func, TRACE_GRAPH_ENT, cpu, ent->pid); + if (ret == TRACE_TYPE_PARTIAL_LINE) + return TRACE_TYPE_PARTIAL_LINE; + + /* Absolute time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) { + ret = print_graph_abs_time(iter->ts, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + } + /* Cpu */ if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) { ret = print_graph_cpu(s, cpu); @@ -460,8 +598,9 @@ print_graph_entry(struct ftrace_graph_en return TRACE_TYPE_PARTIAL_LINE; } - if (trace_branch_is_leaf(iter, field)) - return print_graph_entry_leaf(iter, field, s); + leaf_ret = get_return_for_leaf(iter, field); + if (leaf_ret) + return print_graph_entry_leaf(iter, field, leaf_ret, s); else return print_graph_entry_nested(field, s, iter->ent->pid, cpu); @@ -469,16 +608,25 @@ print_graph_entry(struct ftrace_graph_en static enum print_line_t print_graph_return(struct ftrace_graph_ret *trace, struct trace_seq *s, - struct trace_entry *ent, int cpu) + struct trace_entry *ent, struct trace_iterator *iter) { int i; int ret; + int cpu = iter->cpu; + pid_t *last_pid = iter->private, pid = ent->pid; unsigned long long duration = trace->rettime - trace->calltime; /* Pid */ - if (verif_pid(s, ent->pid, cpu) == TRACE_TYPE_PARTIAL_LINE) + if (verif_pid(s, pid, cpu, last_pid) == TRACE_TYPE_PARTIAL_LINE) return TRACE_TYPE_PARTIAL_LINE; + /* Absolute time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) { + ret = print_graph_abs_time(iter->ts, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + } + /* Cpu */ if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) { ret = print_graph_cpu(s, cpu); @@ -498,16 +646,16 @@ print_graph_return(struct ftrace_graph_r } /* Overhead */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - ret = print_graph_overhead(duration, s); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - } + ret = print_graph_overhead(duration, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; /* Duration */ - ret = print_graph_duration(duration, s); - if (ret == TRACE_TYPE_PARTIAL_LINE) - return TRACE_TYPE_PARTIAL_LINE; + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) { + ret = print_graph_duration(duration, s); + if (ret == TRACE_TYPE_PARTIAL_LINE) + return TRACE_TYPE_PARTIAL_LINE; + } /* Closing brace */ for (i = 0; i < trace->depth * TRACE_GRAPH_INDENT; i++) { @@ -528,7 +676,7 @@ print_graph_return(struct ftrace_graph_r return TRACE_TYPE_PARTIAL_LINE; } - ret = print_graph_irq(s, trace->func, TRACE_GRAPH_RET, cpu, ent->pid); + ret = print_graph_irq(iter, trace->func, TRACE_GRAPH_RET, cpu, pid); if (ret == TRACE_TYPE_PARTIAL_LINE) return TRACE_TYPE_PARTIAL_LINE; @@ -541,14 +689,23 @@ print_graph_comment(struct print_entry * { int i; int ret; + int cpu = iter->cpu; + pid_t *last_pid = iter->private; /* Pid */ - if (verif_pid(s, ent->pid, iter->cpu) == TRACE_TYPE_PARTIAL_LINE) + if (verif_pid(s, ent->pid, cpu, last_pid) == TRACE_TYPE_PARTIAL_LINE) return TRACE_TYPE_PARTIAL_LINE; + /* Absolute time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) { + ret = print_graph_abs_time(iter->ts, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + } + /* Cpu */ if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) { - ret = print_graph_cpu(s, iter->cpu); + ret = print_graph_cpu(s, cpu); if (ret == TRACE_TYPE_PARTIAL_LINE) return TRACE_TYPE_PARTIAL_LINE; } @@ -565,17 +722,17 @@ print_graph_comment(struct print_entry * } /* No overhead */ - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - ret = trace_seq_printf(s, " "); + ret = print_graph_overhead(-1, s); + if (!ret) + return TRACE_TYPE_PARTIAL_LINE; + + /* No time */ + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) { + ret = trace_seq_printf(s, " | "); if (!ret) return TRACE_TYPE_PARTIAL_LINE; } - /* No time */ - ret = trace_seq_printf(s, " | "); - if (!ret) - return TRACE_TYPE_PARTIAL_LINE; - /* Indentation */ if (trace->depth > 0) for (i = 0; i < (trace->depth + 1) * TRACE_GRAPH_INDENT; i++) { @@ -589,8 +746,11 @@ print_graph_comment(struct print_entry * if (!ret) return TRACE_TYPE_PARTIAL_LINE; - if (ent->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); + /* Strip ending newline */ + if (s->buffer[s->len - 1] == '\n') { + s->buffer[s->len - 1] = '\0'; + s->len--; + } ret = trace_seq_printf(s, " */\n"); if (!ret) @@ -610,13 +770,12 @@ print_graph_function(struct trace_iterat case TRACE_GRAPH_ENT: { struct ftrace_graph_ent_entry *field; trace_assign_type(field, entry); - return print_graph_entry(field, s, iter, - iter->cpu); + return print_graph_entry(field, s, iter); } case TRACE_GRAPH_RET: { struct ftrace_graph_ret_entry *field; trace_assign_type(field, entry); - return print_graph_return(&field->ret, s, entry, iter->cpu); + return print_graph_return(&field->ret, s, entry, iter); } case TRACE_PRINT: { struct print_entry *field; @@ -632,33 +791,64 @@ static void print_graph_headers(struct s { /* 1st line */ seq_printf(s, "# "); + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) + seq_printf(s, " TIME "); if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) - seq_printf(s, "CPU "); + seq_printf(s, "CPU"); if (tracer_flags.val & TRACE_GRAPH_PRINT_PROC) - seq_printf(s, "TASK/PID "); - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) - seq_printf(s, "OVERHEAD/"); - seq_printf(s, "DURATION FUNCTION CALLS\n"); + seq_printf(s, " TASK/PID "); + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) + seq_printf(s, " DURATION "); + seq_printf(s, " FUNCTION CALLS\n"); /* 2nd line */ seq_printf(s, "# "); + if (tracer_flags.val & TRACE_GRAPH_PRINT_ABS_TIME) + seq_printf(s, " | "); if (tracer_flags.val & TRACE_GRAPH_PRINT_CPU) - seq_printf(s, "| "); + seq_printf(s, "| "); if (tracer_flags.val & TRACE_GRAPH_PRINT_PROC) - seq_printf(s, "| | "); - if (tracer_flags.val & TRACE_GRAPH_PRINT_OVERHEAD) { - seq_printf(s, "| "); - seq_printf(s, "| | | | |\n"); - } else - seq_printf(s, " | | | | |\n"); + seq_printf(s, " | | "); + if (tracer_flags.val & TRACE_GRAPH_PRINT_DURATION) + seq_printf(s, " | | "); + seq_printf(s, " | | | |\n"); } + +static void graph_trace_open(struct trace_iterator *iter) +{ + /* pid on the last trace processed */ + pid_t *last_pid = alloc_percpu(pid_t); + int cpu; + + if (!last_pid) + pr_warning("function graph tracer: not enough memory\n"); + else + for_each_possible_cpu(cpu) { + pid_t *pid = per_cpu_ptr(last_pid, cpu); + *pid = -1; + } + + iter->private = last_pid; +} + +static void graph_trace_close(struct trace_iterator *iter) +{ + percpu_free(iter->private); +} + static struct tracer graph_trace __read_mostly = { .name = "function_graph", + .open = graph_trace_open, + .close = graph_trace_close, + .wait_pipe = poll_wait_pipe, .init = graph_trace_init, .reset = graph_trace_reset, .print_line = print_graph_function, .print_header = print_graph_headers, .flags = &tracer_flags, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_function_graph, +#endif }; static __init int init_graph_trace(void) Index: linux-2.6-tip/kernel/trace/trace_hw_branches.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_hw_branches.c +++ linux-2.6-tip/kernel/trace/trace_hw_branches.c @@ -1,7 +1,8 @@ /* * h/w branch tracer for x86 based on bts * - * Copyright (C) 2008 Markus Metzger + * Copyright (C) 2008-2009 Intel Corporation. + * Markus Metzger , 2008-2009 * */ @@ -10,21 +11,44 @@ #include #include #include +#include +#include +#include #include #include "trace.h" +#include "trace_output.h" #define SIZEOF_BTS (1 << 13) +/* The tracer mutex protects the below per-cpu tracer array. + It needs to be held to: + - start tracing on all cpus + - stop tracing on all cpus + - start tracing on a single hotplug cpu + - stop tracing on a single hotplug cpu + - read the trace from all cpus + - read the trace from a single cpu +*/ +static DEFINE_MUTEX(bts_tracer_mutex); static DEFINE_PER_CPU(struct bts_tracer *, tracer); static DEFINE_PER_CPU(unsigned char[SIZEOF_BTS], buffer); #define this_tracer per_cpu(tracer, smp_processor_id()) #define this_buffer per_cpu(buffer, smp_processor_id()) +static int __read_mostly trace_hw_branches_enabled; +static struct trace_array *hw_branch_trace __read_mostly; + +/* + * Start tracing on the current cpu. + * The argument is ignored. + * + * pre: bts_tracer_mutex must be locked. + */ static void bts_trace_start_cpu(void *arg) { if (this_tracer) @@ -42,14 +66,20 @@ static void bts_trace_start_cpu(void *ar static void bts_trace_start(struct trace_array *tr) { - int cpu; + mutex_lock(&bts_tracer_mutex); - tracing_reset_online_cpus(tr); + on_each_cpu(bts_trace_start_cpu, NULL, 1); + trace_hw_branches_enabled = 1; - for_each_cpu(cpu, cpu_possible_mask) - smp_call_function_single(cpu, bts_trace_start_cpu, NULL, 1); + mutex_unlock(&bts_tracer_mutex); } +/* + * Stop tracing on the current cpu. + * The argument is ignored. + * + * pre: bts_tracer_mutex must be locked. + */ static void bts_trace_stop_cpu(void *arg) { if (this_tracer) { @@ -60,26 +90,62 @@ static void bts_trace_stop_cpu(void *arg static void bts_trace_stop(struct trace_array *tr) { - int cpu; + mutex_lock(&bts_tracer_mutex); + + trace_hw_branches_enabled = 0; + on_each_cpu(bts_trace_stop_cpu, NULL, 1); + + mutex_unlock(&bts_tracer_mutex); +} + +static int __cpuinit bts_hotcpu_handler(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long)hcpu; + + mutex_lock(&bts_tracer_mutex); + + if (!trace_hw_branches_enabled) + goto out; - for_each_cpu(cpu, cpu_possible_mask) + switch (action) { + case CPU_ONLINE: + case CPU_DOWN_FAILED: + smp_call_function_single(cpu, bts_trace_start_cpu, NULL, 1); + break; + case CPU_DOWN_PREPARE: smp_call_function_single(cpu, bts_trace_stop_cpu, NULL, 1); + break; + } + + out: + mutex_unlock(&bts_tracer_mutex); + return NOTIFY_DONE; } -static int bts_trace_init(struct trace_array *tr) +static struct notifier_block bts_hotcpu_notifier __cpuinitdata = { + .notifier_call = bts_hotcpu_handler +}; + +static int __cpuinit bts_trace_init(struct trace_array *tr) { - tracing_reset_online_cpus(tr); + hw_branch_trace = tr; + + register_hotcpu_notifier(&bts_hotcpu_notifier); bts_trace_start(tr); return 0; } +static void __cpuinit bts_trace_reset(struct trace_array *tr) +{ + bts_trace_stop(tr); + unregister_hotcpu_notifier(&bts_hotcpu_notifier); +} + static void bts_trace_print_header(struct seq_file *m) { - seq_puts(m, - "# CPU# FROM TO FUNCTION\n"); - seq_puts(m, - "# | | | |\n"); + seq_puts(m, "# CPU# TO <- FROM\n"); } static enum print_line_t bts_trace_print_line(struct trace_iterator *iter) @@ -87,15 +153,15 @@ static enum print_line_t bts_trace_print struct trace_entry *entry = iter->ent; struct trace_seq *seq = &iter->seq; struct hw_branch_entry *it; + unsigned long symflags = TRACE_ITER_SYM_OFFSET; trace_assign_type(it, entry); if (entry->type == TRACE_HW_BRANCHES) { - if (trace_seq_printf(seq, "%4d ", entry->cpu) && - trace_seq_printf(seq, "0x%016llx -> 0x%016llx ", - it->from, it->to) && - (!it->from || - seq_print_ip_sym(seq, it->from, /* sym_flags = */ 0)) && + if (trace_seq_printf(seq, "%4d ", iter->cpu) && + seq_print_ip_sym(seq, it->to, symflags) && + trace_seq_printf(seq, "\t <- ") && + seq_print_ip_sym(seq, it->from, symflags) && trace_seq_printf(seq, "\n")) return TRACE_TYPE_HANDLED; return TRACE_TYPE_PARTIAL_LINE;; @@ -103,26 +169,42 @@ static enum print_line_t bts_trace_print return TRACE_TYPE_UNHANDLED; } -void trace_hw_branch(struct trace_array *tr, u64 from, u64 to) +void trace_hw_branch(u64 from, u64 to) { + struct trace_array *tr = hw_branch_trace; struct ring_buffer_event *event; struct hw_branch_entry *entry; - unsigned long irq; + unsigned long irq1; + int cpu; - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq); - if (!event) + if (unlikely(!tr)) return; + + if (unlikely(!trace_hw_branches_enabled)) + return; + + local_irq_save(irq1); + cpu = raw_smp_processor_id(); + if (atomic_inc_return(&tr->data[cpu]->disabled) != 1) + goto out; + + event = trace_buffer_lock_reserve(tr, TRACE_HW_BRANCHES, + sizeof(*entry), 0, 0); + if (!event) + goto out; entry = ring_buffer_event_data(event); tracing_generic_entry_update(&entry->ent, 0, from); entry->ent.type = TRACE_HW_BRANCHES; - entry->ent.cpu = smp_processor_id(); entry->from = from; entry->to = to; - ring_buffer_unlock_commit(tr->buffer, event, irq); + trace_buffer_unlock_commit(tr, event, 0, 0); + + out: + atomic_dec(&tr->data[cpu]->disabled); + local_irq_restore(irq1); } -static void trace_bts_at(struct trace_array *tr, - const struct bts_trace *trace, void *at) +static void trace_bts_at(const struct bts_trace *trace, void *at) { struct bts_struct bts; int err = 0; @@ -137,18 +219,29 @@ static void trace_bts_at(struct trace_ar switch (bts.qualifier) { case BTS_BRANCH: - trace_hw_branch(tr, bts.variant.lbr.from, bts.variant.lbr.to); + trace_hw_branch(bts.variant.lbr.from, bts.variant.lbr.to); break; } } +/* + * Collect the trace on the current cpu and write it into the ftrace buffer. + * + * pre: bts_tracer_mutex must be locked + */ static void trace_bts_cpu(void *arg) { struct trace_array *tr = (struct trace_array *) arg; const struct bts_trace *trace; unsigned char *at; - if (!this_tracer) + if (unlikely(!tr)) + return; + + if (unlikely(atomic_read(&tr->data[raw_smp_processor_id()]->disabled))) + return; + + if (unlikely(!this_tracer)) return; ds_suspend_bts(this_tracer); @@ -158,11 +251,11 @@ static void trace_bts_cpu(void *arg) for (at = trace->ds.top; (void *)at < trace->ds.end; at += trace->ds.size) - trace_bts_at(tr, trace, at); + trace_bts_at(trace, at); for (at = trace->ds.begin; (void *)at < trace->ds.top; at += trace->ds.size) - trace_bts_at(tr, trace, at); + trace_bts_at(trace, at); out: ds_resume_bts(this_tracer); @@ -170,22 +263,38 @@ out: static void trace_bts_prepare(struct trace_iterator *iter) { - int cpu; + mutex_lock(&bts_tracer_mutex); + + on_each_cpu(trace_bts_cpu, iter->tr, 1); + + mutex_unlock(&bts_tracer_mutex); +} + +static void trace_bts_close(struct trace_iterator *iter) +{ + tracing_reset_online_cpus(iter->tr); +} + +void trace_hw_branch_oops(void) +{ + mutex_lock(&bts_tracer_mutex); + + trace_bts_cpu(hw_branch_trace); - for_each_cpu(cpu, cpu_possible_mask) - smp_call_function_single(cpu, trace_bts_cpu, iter->tr, 1); + mutex_unlock(&bts_tracer_mutex); } struct tracer bts_tracer __read_mostly = { .name = "hw-branch-tracer", .init = bts_trace_init, - .reset = bts_trace_stop, + .reset = bts_trace_reset, .print_header = bts_trace_print_header, .print_line = bts_trace_print_line, .start = bts_trace_start, .stop = bts_trace_stop, - .open = trace_bts_prepare + .open = trace_bts_prepare, + .close = trace_bts_close }; __init static int init_bts_trace(void) Index: linux-2.6-tip/kernel/trace/trace_irqsoff.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_irqsoff.c +++ linux-2.6-tip/kernel/trace/trace_irqsoff.c @@ -1,5 +1,5 @@ /* - * trace irqs off criticall timings + * trace irqs off critical timings * * Copyright (C) 2007-2008 Steven Rostedt * Copyright (C) 2008 Ingo Molnar @@ -23,7 +23,7 @@ static int tracer_enabled __read_most static DEFINE_PER_CPU(int, tracing_cpu); -static DEFINE_SPINLOCK(max_trace_lock); +static DEFINE_RAW_SPINLOCK(max_trace_lock); enum { TRACER_IRQS_OFF = (1 << 1), @@ -95,7 +95,7 @@ irqsoff_tracer_call(unsigned long ip, un disabled = atomic_inc_return(&data->disabled); if (likely(disabled == 1)) - trace_function(tr, data, ip, parent_ip, flags, preempt_count()); + trace_function(tr, ip, parent_ip, flags, preempt_count()); atomic_dec(&data->disabled); } @@ -153,7 +153,7 @@ check_critical_timing(struct trace_array if (!report_latency(delta)) goto out_unlock; - trace_function(tr, data, CALLER_ADDR0, parent_ip, flags, pc); + trace_function(tr, CALLER_ADDR0, parent_ip, flags, pc); latency = nsecs_to_usecs(delta); @@ -177,7 +177,7 @@ out: data->critical_sequence = max_sequence; data->preempt_timestamp = ftrace_now(cpu); tracing_reset(tr, cpu); - trace_function(tr, data, CALLER_ADDR0, parent_ip, flags, pc); + trace_function(tr, CALLER_ADDR0, parent_ip, flags, pc); } static inline void @@ -210,7 +210,7 @@ start_critical_timing(unsigned long ip, local_save_flags(flags); - trace_function(tr, data, ip, parent_ip, flags, preempt_count()); + trace_function(tr, ip, parent_ip, flags, preempt_count()); per_cpu(tracing_cpu, cpu) = 1; @@ -244,7 +244,7 @@ stop_critical_timing(unsigned long ip, u atomic_inc(&data->disabled); local_save_flags(flags); - trace_function(tr, data, ip, parent_ip, flags, preempt_count()); + trace_function(tr, ip, parent_ip, flags, preempt_count()); check_critical_timing(tr, data, parent_ip ? : ip, cpu); data->critical_start = 0; atomic_dec(&data->disabled); @@ -353,28 +353,18 @@ void trace_preempt_off(unsigned long a0, } #endif /* CONFIG_PREEMPT_TRACER */ -/* - * save_tracer_enabled is used to save the state of the tracer_enabled - * variable when we disable it when we open a trace output file. - */ -static int save_tracer_enabled; - static void start_irqsoff_tracer(struct trace_array *tr) { register_ftrace_function(&trace_ops); - if (tracing_is_enabled()) { + if (tracing_is_enabled()) tracer_enabled = 1; - save_tracer_enabled = 1; - } else { + else tracer_enabled = 0; - save_tracer_enabled = 0; - } } static void stop_irqsoff_tracer(struct trace_array *tr) { tracer_enabled = 0; - save_tracer_enabled = 0; unregister_ftrace_function(&trace_ops); } @@ -395,25 +385,11 @@ static void irqsoff_tracer_reset(struct static void irqsoff_tracer_start(struct trace_array *tr) { tracer_enabled = 1; - save_tracer_enabled = 1; } static void irqsoff_tracer_stop(struct trace_array *tr) { tracer_enabled = 0; - save_tracer_enabled = 0; -} - -static void irqsoff_tracer_open(struct trace_iterator *iter) -{ - /* stop the trace while dumping */ - tracer_enabled = 0; -} - -static void irqsoff_tracer_close(struct trace_iterator *iter) -{ - /* restart tracing */ - tracer_enabled = save_tracer_enabled; } #ifdef CONFIG_IRQSOFF_TRACER @@ -431,8 +407,6 @@ static struct tracer irqsoff_tracer __re .reset = irqsoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, - .open = irqsoff_tracer_open, - .close = irqsoff_tracer_close, .print_max = 1, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_irqsoff, @@ -459,8 +433,6 @@ static struct tracer preemptoff_tracer _ .reset = irqsoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, - .open = irqsoff_tracer_open, - .close = irqsoff_tracer_close, .print_max = 1, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_preemptoff, @@ -489,8 +461,6 @@ static struct tracer preemptirqsoff_trac .reset = irqsoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, - .open = irqsoff_tracer_open, - .close = irqsoff_tracer_close, .print_max = 1, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_preemptirqsoff, Index: linux-2.6-tip/kernel/trace/trace_mmiotrace.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_mmiotrace.c +++ linux-2.6-tip/kernel/trace/trace_mmiotrace.c @@ -12,6 +12,7 @@ #include #include "trace.h" +#include "trace_output.h" struct header_iter { struct pci_dev *dev; @@ -183,21 +184,22 @@ static enum print_line_t mmio_print_rw(s switch (rw->opcode) { case MMIO_READ: ret = trace_seq_printf(s, - "R %d %lu.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", + "R %d %u.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", rw->width, secs, usec_rem, rw->map_id, (unsigned long long)rw->phys, rw->value, rw->pc, 0); break; case MMIO_WRITE: ret = trace_seq_printf(s, - "W %d %lu.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", + "W %d %u.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", rw->width, secs, usec_rem, rw->map_id, (unsigned long long)rw->phys, rw->value, rw->pc, 0); break; case MMIO_UNKNOWN_OP: ret = trace_seq_printf(s, - "UNKNOWN %lu.%06lu %d 0x%llx %02x,%02x,%02x 0x%lx %d\n", + "UNKNOWN %u.%06lu %d 0x%llx %02lx,%02lx," + "%02lx 0x%lx %d\n", secs, usec_rem, rw->map_id, (unsigned long long)rw->phys, (rw->value >> 16) & 0xff, (rw->value >> 8) & 0xff, @@ -229,14 +231,14 @@ static enum print_line_t mmio_print_map( switch (m->opcode) { case MMIO_PROBE: ret = trace_seq_printf(s, - "MAP %lu.%06lu %d 0x%llx 0x%lx 0x%lx 0x%lx %d\n", + "MAP %u.%06lu %d 0x%llx 0x%lx 0x%lx 0x%lx %d\n", secs, usec_rem, m->map_id, (unsigned long long)m->phys, m->virt, m->len, 0UL, 0); break; case MMIO_UNPROBE: ret = trace_seq_printf(s, - "UNMAP %lu.%06lu %d 0x%lx %d\n", + "UNMAP %u.%06lu %d 0x%lx %d\n", secs, usec_rem, m->map_id, 0UL, 0); break; default: @@ -260,13 +262,10 @@ static enum print_line_t mmio_print_mark int ret; /* The trailing newline must be in the message. */ - ret = trace_seq_printf(s, "MARK %lu.%06lu %s", secs, usec_rem, msg); + ret = trace_seq_printf(s, "MARK %u.%06lu %s", secs, usec_rem, msg); if (!ret) return TRACE_TYPE_PARTIAL_LINE; - if (entry->flags & TRACE_FLAG_CONT) - trace_seq_print_cont(s, iter); - return TRACE_TYPE_HANDLED; } @@ -308,21 +307,17 @@ static void __trace_mmiotrace_rw(struct { struct ring_buffer_event *event; struct trace_mmiotrace_rw *entry; - unsigned long irq_flags; + int pc = preempt_count(); - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_MMIO_RW, + sizeof(*entry), 0, pc); if (!event) { atomic_inc(&dropped_count); return; } entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, preempt_count()); - entry->ent.type = TRACE_MMIO_RW; entry->rw = *rw; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); + trace_buffer_unlock_commit(tr, event, 0, pc); } void mmio_trace_rw(struct mmiotrace_rw *rw) @@ -338,21 +333,17 @@ static void __trace_mmiotrace_map(struct { struct ring_buffer_event *event; struct trace_mmiotrace_map *entry; - unsigned long irq_flags; + int pc = preempt_count(); - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); + event = trace_buffer_lock_reserve(tr, TRACE_MMIO_MAP, + sizeof(*entry), 0, pc); if (!event) { atomic_inc(&dropped_count); return; } entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, preempt_count()); - entry->ent.type = TRACE_MMIO_MAP; entry->map = *map; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); + trace_buffer_unlock_commit(tr, event, 0, pc); } void mmio_trace_mapping(struct mmiotrace_map *map) Index: linux-2.6-tip/kernel/trace/trace_nop.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_nop.c +++ linux-2.6-tip/kernel/trace/trace_nop.c @@ -47,12 +47,7 @@ static void stop_nop_trace(struct trace_ static int nop_trace_init(struct trace_array *tr) { - int cpu; ctx_trace = tr; - - for_each_online_cpu(cpu) - tracing_reset(tr, cpu); - start_nop_trace(tr); return 0; } Index: linux-2.6-tip/kernel/trace/trace_output.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/trace_output.c @@ -0,0 +1,919 @@ +/* + * trace_output.c + * + * Copyright (C) 2008 Red Hat Inc, Steven Rostedt + * + */ + +#include +#include +#include + +#include "trace_output.h" + +/* must be a power of 2 */ +#define EVENT_HASHSIZE 128 + +static DEFINE_MUTEX(trace_event_mutex); +static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly; + +static int next_event_type = __TRACE_LAST_TYPE + 1; + +/** + * trace_seq_printf - sequence printing of trace information + * @s: trace sequence descriptor + * @fmt: printf format string + * + * The tracer may use either sequence operations or its own + * copy to user routines. To simplify formating of a trace + * trace_seq_printf is used to store strings into a special + * buffer (@s). Then the output may be either used by + * the sequencer or pulled into another buffer. + */ +int +trace_seq_printf(struct trace_seq *s, const char *fmt, ...) +{ + int len = (PAGE_SIZE - 1) - s->len; + va_list ap; + int ret; + + if (!len) + return 0; + + va_start(ap, fmt); + ret = vsnprintf(s->buffer + s->len, len, fmt, ap); + va_end(ap); + + /* If we can't write it all, don't bother writing anything */ + if (ret >= len) + return 0; + + s->len += ret; + + return len; +} + +/** + * trace_seq_puts - trace sequence printing of simple string + * @s: trace sequence descriptor + * @str: simple string to record + * + * The tracer may use either the sequence operations or its own + * copy to user routines. This function records a simple string + * into a special buffer (@s) for later retrieval by a sequencer + * or other mechanism. + */ +int trace_seq_puts(struct trace_seq *s, const char *str) +{ + int len = strlen(str); + + if (len > ((PAGE_SIZE - 1) - s->len)) + return 0; + + memcpy(s->buffer + s->len, str, len); + s->len += len; + + return len; +} + +int trace_seq_putc(struct trace_seq *s, unsigned char c) +{ + if (s->len >= (PAGE_SIZE - 1)) + return 0; + + s->buffer[s->len++] = c; + + return 1; +} + +int trace_seq_putmem(struct trace_seq *s, void *mem, size_t len) +{ + if (len > ((PAGE_SIZE - 1) - s->len)) + return 0; + + memcpy(s->buffer + s->len, mem, len); + s->len += len; + + return len; +} + +int trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len) +{ + unsigned char hex[HEX_CHARS]; + unsigned char *data = mem; + int i, j; + +#ifdef __BIG_ENDIAN + for (i = 0, j = 0; i < len; i++) { +#else + for (i = len-1, j = 0; i >= 0; i--) { +#endif + hex[j++] = hex_asc_hi(data[i]); + hex[j++] = hex_asc_lo(data[i]); + } + hex[j++] = ' '; + + return trace_seq_putmem(s, hex, j); +} + +int trace_seq_path(struct trace_seq *s, struct path *path) +{ + unsigned char *p; + + if (s->len >= (PAGE_SIZE - 1)) + return 0; + p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len); + if (!IS_ERR(p)) { + p = mangle_path(s->buffer + s->len, p, "\n"); + if (p) { + s->len = p - s->buffer; + return 1; + } + } else { + s->buffer[s->len++] = '?'; + return 1; + } + + return 0; +} + +#ifdef CONFIG_KRETPROBES +static inline const char *kretprobed(const char *name) +{ + static const char tramp_name[] = "kretprobe_trampoline"; + int size = sizeof(tramp_name); + + if (strncmp(tramp_name, name, size) == 0) + return "[unknown/kretprobe'd]"; + return name; +} +#else +static inline const char *kretprobed(const char *name) +{ + return name; +} +#endif /* CONFIG_KRETPROBES */ + +static int +seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address) +{ +#ifdef CONFIG_KALLSYMS + char str[KSYM_SYMBOL_LEN]; + const char *name; + + kallsyms_lookup(address, NULL, NULL, NULL, str); + + name = kretprobed(str); + + return trace_seq_printf(s, fmt, name); +#endif + return 1; +} + +static int +seq_print_sym_offset(struct trace_seq *s, const char *fmt, + unsigned long address) +{ +#ifdef CONFIG_KALLSYMS + char str[KSYM_SYMBOL_LEN]; + const char *name; + + sprint_symbol(str, address); + name = kretprobed(str); + + return trace_seq_printf(s, fmt, name); +#endif + return 1; +} + +#ifndef CONFIG_64BIT +# define IP_FMT "%08lx" +#else +# define IP_FMT "%016lx" +#endif + +int seq_print_user_ip(struct trace_seq *s, struct mm_struct *mm, + unsigned long ip, unsigned long sym_flags) +{ + struct file *file = NULL; + unsigned long vmstart = 0; + int ret = 1; + + if (mm) { + const struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + vma = find_vma(mm, ip); + if (vma) { + file = vma->vm_file; + vmstart = vma->vm_start; + } + if (file) { + ret = trace_seq_path(s, &file->f_path); + if (ret) + ret = trace_seq_printf(s, "[+0x%lx]", + ip - vmstart); + } + up_read(&mm->mmap_sem); + } + if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file)) + ret = trace_seq_printf(s, " <" IP_FMT ">", ip); + return ret; +} + +int +seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s, + unsigned long sym_flags) +{ + struct mm_struct *mm = NULL; + int ret = 1; + unsigned int i; + + if (trace_flags & TRACE_ITER_SYM_USEROBJ) { + struct task_struct *task; + /* + * we do the lookup on the thread group leader, + * since individual threads might have already quit! + */ + rcu_read_lock(); + task = find_task_by_vpid(entry->ent.tgid); + if (task) + mm = get_task_mm(task); + rcu_read_unlock(); + } + + for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { + unsigned long ip = entry->caller[i]; + + if (ip == ULONG_MAX || !ret) + break; + if (i && ret) + ret = trace_seq_puts(s, " <- "); + if (!ip) { + if (ret) + ret = trace_seq_puts(s, "??"); + continue; + } + if (!ret) + break; + if (ret) + ret = seq_print_user_ip(s, mm, ip, sym_flags); + } + + if (mm) + mmput(mm); + return ret; +} + +int +seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags) +{ + int ret; + + if (!ip) + return trace_seq_printf(s, "0"); + + if (sym_flags & TRACE_ITER_SYM_OFFSET) + ret = seq_print_sym_offset(s, "%s", ip); + else + ret = seq_print_sym_short(s, "%s", ip); + + if (!ret) + return 0; + + if (sym_flags & TRACE_ITER_SYM_ADDR) + ret = trace_seq_printf(s, " <" IP_FMT ">", ip); + return ret; +} + +static int +lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu) +{ + int hardirq, softirq; + char *comm; + + comm = trace_find_cmdline(entry->pid); + hardirq = entry->flags & TRACE_FLAG_HARDIRQ; + softirq = entry->flags & TRACE_FLAG_SOFTIRQ; + + if (!trace_seq_printf(s, "%8.8s-%-5d %3d%c%c%c", + comm, entry->pid, cpu, + (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : + (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? + 'X' : '.', + (entry->flags & TRACE_FLAG_NEED_RESCHED) ? + 'N' : '.', + (hardirq && softirq) ? 'H' : + hardirq ? 'h' : softirq ? 's' : '.')) + return 0; + + if (entry->preempt_count) + return trace_seq_printf(s, "%x", entry->preempt_count); + return trace_seq_puts(s, "."); +} + +static unsigned long preempt_mark_thresh = 100; + +static int +lat_print_timestamp(struct trace_seq *s, u64 abs_usecs, + unsigned long rel_usecs) +{ + return trace_seq_printf(s, " %4lldus%c: ", abs_usecs, + rel_usecs > preempt_mark_thresh ? '!' : + rel_usecs > 1 ? '+' : ' '); +} + +int trace_print_context(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + struct trace_entry *entry = iter->ent; + char *comm = trace_find_cmdline(entry->pid); + unsigned long long t = ns2usecs(iter->ts); + unsigned long usec_rem = do_div(t, USEC_PER_SEC); + unsigned long secs = (unsigned long)t; + + return trace_seq_printf(s, "%16s-%-5d [%03d] %5lu.%06lu: ", + comm, entry->pid, iter->cpu, secs, usec_rem); +} + +int trace_print_lat_context(struct trace_iterator *iter) +{ + u64 next_ts; + int ret; + struct trace_seq *s = &iter->seq; + struct trace_entry *entry = iter->ent, + *next_entry = trace_find_next_entry(iter, NULL, + &next_ts); + unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE); + unsigned long abs_usecs = ns2usecs(iter->ts - iter->tr->time_start); + unsigned long rel_usecs; + + if (!next_entry) + next_ts = iter->ts; + rel_usecs = ns2usecs(next_ts - iter->ts); + + if (verbose) { + char *comm = trace_find_cmdline(entry->pid); + ret = trace_seq_printf(s, "%16s %5d %3d %d %08x %08lx [%08lx]" + " %ld.%03ldms (+%ld.%03ldms): ", comm, + entry->pid, iter->cpu, entry->flags, + entry->preempt_count, iter->idx, + ns2usecs(iter->ts), + abs_usecs / USEC_PER_MSEC, + abs_usecs % USEC_PER_MSEC, + rel_usecs / USEC_PER_MSEC, + rel_usecs % USEC_PER_MSEC); + } else { + ret = lat_print_generic(s, entry, iter->cpu); + if (ret) + ret = lat_print_timestamp(s, abs_usecs, rel_usecs); + } + + return ret; +} + +static const char state_to_char[] = TASK_STATE_TO_CHAR_STR; + +static int task_state_char(unsigned long state) +{ + int bit = state ? __ffs(state) + 1 : 0; + + return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?'; +} + +/** + * ftrace_find_event - find a registered event + * @type: the type of event to look for + * + * Returns an event of type @type otherwise NULL + */ +struct trace_event *ftrace_find_event(int type) +{ + struct trace_event *event; + struct hlist_node *n; + unsigned key; + + key = type & (EVENT_HASHSIZE - 1); + + hlist_for_each_entry_rcu(event, n, &event_hash[key], node) { + if (event->type == type) + return event; + } + + return NULL; +} + +/** + * register_ftrace_event - register output for an event type + * @event: the event type to register + * + * Event types are stored in a hash and this hash is used to + * find a way to print an event. If the @event->type is set + * then it will use that type, otherwise it will assign a + * type to use. + * + * If you assign your own type, please make sure it is added + * to the trace_type enum in trace.h, to avoid collisions + * with the dynamic types. + * + * Returns the event type number or zero on error. + */ +int register_ftrace_event(struct trace_event *event) +{ + unsigned key; + int ret = 0; + + mutex_lock(&trace_event_mutex); + + if (!event->type) + event->type = next_event_type++; + else if (event->type > __TRACE_LAST_TYPE) { + printk(KERN_WARNING "Need to add type to trace.h\n"); + WARN_ON(1); + } + + if (ftrace_find_event(event->type)) + goto out; + + if (event->trace == NULL) + event->trace = trace_nop_print; + if (event->latency_trace == NULL) + event->latency_trace = trace_nop_print; + if (event->raw == NULL) + event->raw = trace_nop_print; + if (event->hex == NULL) + event->hex = trace_nop_print; + if (event->binary == NULL) + event->binary = trace_nop_print; + + key = event->type & (EVENT_HASHSIZE - 1); + + hlist_add_head_rcu(&event->node, &event_hash[key]); + + ret = event->type; + out: + mutex_unlock(&trace_event_mutex); + + return ret; +} + +/** + * unregister_ftrace_event - remove a no longer used event + * @event: the event to remove + */ +int unregister_ftrace_event(struct trace_event *event) +{ + mutex_lock(&trace_event_mutex); + hlist_del(&event->node); + mutex_unlock(&trace_event_mutex); + + return 0; +} + +/* + * Standard events + */ + +enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags) +{ + return TRACE_TYPE_HANDLED; +} + +/* TRACE_FN */ +static enum print_line_t trace_fn_latency(struct trace_iterator *iter, + int flags) +{ + struct ftrace_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + if (!seq_print_ip_sym(s, field->ip, flags)) + goto partial; + if (!trace_seq_puts(s, " (")) + goto partial; + if (!seq_print_ip_sym(s, field->parent_ip, flags)) + goto partial; + if (!trace_seq_puts(s, ")\n")) + goto partial; + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags) +{ + struct ftrace_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + if (!seq_print_ip_sym(s, field->ip, flags)) + goto partial; + + if ((flags & TRACE_ITER_PRINT_PARENT) && field->parent_ip) { + if (!trace_seq_printf(s, " <-")) + goto partial; + if (!seq_print_ip_sym(s, + field->parent_ip, + flags)) + goto partial; + } + if (!trace_seq_printf(s, "\n")) + goto partial; + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static enum print_line_t trace_fn_raw(struct trace_iterator *iter, int flags) +{ + struct ftrace_entry *field; + + trace_assign_type(field, iter->ent); + + if (!trace_seq_printf(&iter->seq, "%lx %lx\n", + field->ip, + field->parent_ip)) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_fn_hex(struct trace_iterator *iter, int flags) +{ + struct ftrace_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + SEQ_PUT_HEX_FIELD_RET(s, field->ip); + SEQ_PUT_HEX_FIELD_RET(s, field->parent_ip); + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_fn_bin(struct trace_iterator *iter, int flags) +{ + struct ftrace_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + SEQ_PUT_FIELD_RET(s, field->ip); + SEQ_PUT_FIELD_RET(s, field->parent_ip); + + return TRACE_TYPE_HANDLED; +} + +static struct trace_event trace_fn_event = { + .type = TRACE_FN, + .trace = trace_fn_trace, + .latency_trace = trace_fn_latency, + .raw = trace_fn_raw, + .hex = trace_fn_hex, + .binary = trace_fn_bin, +}; + +/* TRACE_CTX an TRACE_WAKE */ +static enum print_line_t trace_ctxwake_print(struct trace_iterator *iter, + char *delim) +{ + struct ctx_switch_entry *field; + char *comm; + int S, T; + + trace_assign_type(field, iter->ent); + + T = task_state_char(field->next_state); + S = task_state_char(field->prev_state); + comm = trace_find_cmdline(field->next_pid); + if (!trace_seq_printf(&iter->seq, + " %5d:%3d:%c %s [%03d] %5d:%3d:%c %s\n", + field->prev_pid, + field->prev_prio, + S, delim, + field->next_cpu, + field->next_pid, + field->next_prio, + T, comm)) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_ctx_print(struct trace_iterator *iter, int flags) +{ + return trace_ctxwake_print(iter, "==>"); +} + +static enum print_line_t trace_wake_print(struct trace_iterator *iter, + int flags) +{ + return trace_ctxwake_print(iter, " +"); +} + +static int trace_ctxwake_raw(struct trace_iterator *iter, char S) +{ + struct ctx_switch_entry *field; + int T; + + trace_assign_type(field, iter->ent); + + if (!S) + task_state_char(field->prev_state); + T = task_state_char(field->next_state); + if (!trace_seq_printf(&iter->seq, "%d %d %c %d %d %d %c\n", + field->prev_pid, + field->prev_prio, + S, + field->next_cpu, + field->next_pid, + field->next_prio, + T)) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_ctx_raw(struct trace_iterator *iter, int flags) +{ + return trace_ctxwake_raw(iter, 0); +} + +static enum print_line_t trace_wake_raw(struct trace_iterator *iter, int flags) +{ + return trace_ctxwake_raw(iter, '+'); +} + + +static int trace_ctxwake_hex(struct trace_iterator *iter, char S) +{ + struct ctx_switch_entry *field; + struct trace_seq *s = &iter->seq; + int T; + + trace_assign_type(field, iter->ent); + + if (!S) + task_state_char(field->prev_state); + T = task_state_char(field->next_state); + + SEQ_PUT_HEX_FIELD_RET(s, field->prev_pid); + SEQ_PUT_HEX_FIELD_RET(s, field->prev_prio); + SEQ_PUT_HEX_FIELD_RET(s, S); + SEQ_PUT_HEX_FIELD_RET(s, field->next_cpu); + SEQ_PUT_HEX_FIELD_RET(s, field->next_pid); + SEQ_PUT_HEX_FIELD_RET(s, field->next_prio); + SEQ_PUT_HEX_FIELD_RET(s, T); + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_ctx_hex(struct trace_iterator *iter, int flags) +{ + return trace_ctxwake_hex(iter, 0); +} + +static enum print_line_t trace_wake_hex(struct trace_iterator *iter, int flags) +{ + return trace_ctxwake_hex(iter, '+'); +} + +static enum print_line_t trace_ctxwake_bin(struct trace_iterator *iter, + int flags) +{ + struct ctx_switch_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + SEQ_PUT_FIELD_RET(s, field->prev_pid); + SEQ_PUT_FIELD_RET(s, field->prev_prio); + SEQ_PUT_FIELD_RET(s, field->prev_state); + SEQ_PUT_FIELD_RET(s, field->next_pid); + SEQ_PUT_FIELD_RET(s, field->next_prio); + SEQ_PUT_FIELD_RET(s, field->next_state); + + return TRACE_TYPE_HANDLED; +} + +static struct trace_event trace_ctx_event = { + .type = TRACE_CTX, + .trace = trace_ctx_print, + .latency_trace = trace_ctx_print, + .raw = trace_ctx_raw, + .hex = trace_ctx_hex, + .binary = trace_ctxwake_bin, +}; + +static struct trace_event trace_wake_event = { + .type = TRACE_WAKE, + .trace = trace_wake_print, + .latency_trace = trace_wake_print, + .raw = trace_wake_raw, + .hex = trace_wake_hex, + .binary = trace_ctxwake_bin, +}; + +/* TRACE_SPECIAL */ +static enum print_line_t trace_special_print(struct trace_iterator *iter, + int flags) +{ + struct special_entry *field; + + trace_assign_type(field, iter->ent); + + if (!trace_seq_printf(&iter->seq, "# %ld %ld %ld\n", + field->arg1, + field->arg2, + field->arg3)) + return TRACE_TYPE_PARTIAL_LINE; + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_special_hex(struct trace_iterator *iter, + int flags) +{ + struct special_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + SEQ_PUT_HEX_FIELD_RET(s, field->arg1); + SEQ_PUT_HEX_FIELD_RET(s, field->arg2); + SEQ_PUT_HEX_FIELD_RET(s, field->arg3); + + return TRACE_TYPE_HANDLED; +} + +static enum print_line_t trace_special_bin(struct trace_iterator *iter, + int flags) +{ + struct special_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + SEQ_PUT_FIELD_RET(s, field->arg1); + SEQ_PUT_FIELD_RET(s, field->arg2); + SEQ_PUT_FIELD_RET(s, field->arg3); + + return TRACE_TYPE_HANDLED; +} + +static struct trace_event trace_special_event = { + .type = TRACE_SPECIAL, + .trace = trace_special_print, + .latency_trace = trace_special_print, + .raw = trace_special_print, + .hex = trace_special_hex, + .binary = trace_special_bin, +}; + +/* TRACE_STACK */ + +static enum print_line_t trace_stack_print(struct trace_iterator *iter, + int flags) +{ + struct stack_entry *field; + struct trace_seq *s = &iter->seq; + int i; + + trace_assign_type(field, iter->ent); + + for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { + if (i) { + if (!trace_seq_puts(s, " <= ")) + goto partial; + + if (!seq_print_ip_sym(s, field->caller[i], flags)) + goto partial; + } + if (!trace_seq_puts(s, "\n")) + goto partial; + } + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static struct trace_event trace_stack_event = { + .type = TRACE_STACK, + .trace = trace_stack_print, + .latency_trace = trace_stack_print, + .raw = trace_special_print, + .hex = trace_special_hex, + .binary = trace_special_bin, +}; + +/* TRACE_USER_STACK */ +static enum print_line_t trace_user_stack_print(struct trace_iterator *iter, + int flags) +{ + struct userstack_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + if (!seq_print_userip_objs(field, s, flags)) + goto partial; + + if (!trace_seq_putc(s, '\n')) + goto partial; + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static struct trace_event trace_user_stack_event = { + .type = TRACE_USER_STACK, + .trace = trace_user_stack_print, + .latency_trace = trace_user_stack_print, + .raw = trace_special_print, + .hex = trace_special_hex, + .binary = trace_special_bin, +}; + +/* TRACE_PRINT */ +static enum print_line_t trace_print_print(struct trace_iterator *iter, + int flags) +{ + struct print_entry *field; + struct trace_seq *s = &iter->seq; + + trace_assign_type(field, iter->ent); + + if (!seq_print_ip_sym(s, field->ip, flags)) + goto partial; + + if (!trace_seq_printf(s, ": %s", field->buf)) + goto partial; + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags) +{ + struct print_entry *field; + + trace_assign_type(field, iter->ent); + + if (!trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf)) + goto partial; + + return TRACE_TYPE_HANDLED; + + partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static struct trace_event trace_print_event = { + .type = TRACE_PRINT, + .trace = trace_print_print, + .latency_trace = trace_print_print, + .raw = trace_print_raw, +}; + +static struct trace_event *events[] __initdata = { + &trace_fn_event, + &trace_ctx_event, + &trace_wake_event, + &trace_special_event, + &trace_stack_event, + &trace_user_stack_event, + &trace_print_event, + NULL +}; + +__init static int init_events(void) +{ + struct trace_event *event; + int i, ret; + + for (i = 0; events[i]; i++) { + event = events[i]; + + ret = register_ftrace_event(event); + if (!ret) { + printk(KERN_WARNING "event %d failed to register\n", + event->type); + WARN_ON_ONCE(1); + } + } + + return 0; +} +device_initcall(init_events); Index: linux-2.6-tip/kernel/trace/trace_output.h =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/trace_output.h @@ -0,0 +1,62 @@ +#ifndef __TRACE_EVENTS_H +#define __TRACE_EVENTS_H + +#include "trace.h" + +typedef enum print_line_t (*trace_print_func)(struct trace_iterator *iter, + int flags); + +struct trace_event { + struct hlist_node node; + int type; + trace_print_func trace; + trace_print_func latency_trace; + trace_print_func raw; + trace_print_func hex; + trace_print_func binary; +}; + +extern int trace_seq_printf(struct trace_seq *s, const char *fmt, ...) + __attribute__ ((format (printf, 2, 3))); +extern int +seq_print_ip_sym(struct trace_seq *s, unsigned long ip, + unsigned long sym_flags); +extern ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, + size_t cnt); +int trace_seq_puts(struct trace_seq *s, const char *str); +int trace_seq_putc(struct trace_seq *s, unsigned char c); +int trace_seq_putmem(struct trace_seq *s, void *mem, size_t len); +int trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len); +int trace_seq_path(struct trace_seq *s, struct path *path); +int seq_print_userip_objs(const struct userstack_entry *entry, + struct trace_seq *s, unsigned long sym_flags); +int seq_print_user_ip(struct trace_seq *s, struct mm_struct *mm, + unsigned long ip, unsigned long sym_flags); + +int trace_print_context(struct trace_iterator *iter); +int trace_print_lat_context(struct trace_iterator *iter); + +struct trace_event *ftrace_find_event(int type); +int register_ftrace_event(struct trace_event *event); +int unregister_ftrace_event(struct trace_event *event); + +enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags); + +#define MAX_MEMHEX_BYTES 8 +#define HEX_CHARS (MAX_MEMHEX_BYTES*2 + 1) + +#define SEQ_PUT_FIELD_RET(s, x) \ +do { \ + if (!trace_seq_putmem(s, &(x), sizeof(x))) \ + return TRACE_TYPE_PARTIAL_LINE; \ +} while (0) + +#define SEQ_PUT_HEX_FIELD_RET(s, x) \ +do { \ + BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES); \ + if (!trace_seq_putmem_hex(s, &(x), sizeof(x))) \ + return TRACE_TYPE_PARTIAL_LINE; \ +} while (0) + +#endif + Index: linux-2.6-tip/kernel/trace/trace_power.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_power.c +++ linux-2.6-tip/kernel/trace/trace_power.c @@ -11,24 +11,126 @@ #include #include -#include +#include #include #include #include "trace.h" +#include "trace_output.h" static struct trace_array *power_trace; static int __read_mostly trace_power_enabled; +static void probe_power_start(struct power_trace *it, unsigned int type, + unsigned int level) +{ + if (!trace_power_enabled) + return; + + memset(it, 0, sizeof(struct power_trace)); + it->state = level; + it->type = type; + it->stamp = ktime_get(); +} + + +static void probe_power_end(struct power_trace *it) +{ + struct ring_buffer_event *event; + struct trace_power *entry; + struct trace_array_cpu *data; + struct trace_array *tr = power_trace; + + if (!trace_power_enabled) + return; + + preempt_disable(); + it->end = ktime_get(); + data = tr->data[smp_processor_id()]; + + event = trace_buffer_lock_reserve(tr, TRACE_POWER, + sizeof(*entry), 0, 0); + if (!event) + goto out; + entry = ring_buffer_event_data(event); + entry->state_data = *it; + trace_buffer_unlock_commit(tr, event, 0, 0); + out: + preempt_enable(); +} + +static void probe_power_mark(struct power_trace *it, unsigned int type, + unsigned int level) +{ + struct ring_buffer_event *event; + struct trace_power *entry; + struct trace_array_cpu *data; + struct trace_array *tr = power_trace; + + if (!trace_power_enabled) + return; + + memset(it, 0, sizeof(struct power_trace)); + it->state = level; + it->type = type; + it->stamp = ktime_get(); + preempt_disable(); + it->end = it->stamp; + data = tr->data[smp_processor_id()]; + + event = trace_buffer_lock_reserve(tr, TRACE_POWER, + sizeof(*entry), 0, 0); + if (!event) + goto out; + entry = ring_buffer_event_data(event); + entry->state_data = *it; + trace_buffer_unlock_commit(tr, event, 0, 0); + out: + preempt_enable(); +} + +static int tracing_power_register(void) +{ + int ret; + + ret = register_trace_power_start(probe_power_start); + if (ret) { + pr_info("power trace: Couldn't activate tracepoint" + " probe to trace_power_start\n"); + return ret; + } + ret = register_trace_power_end(probe_power_end); + if (ret) { + pr_info("power trace: Couldn't activate tracepoint" + " probe to trace_power_end\n"); + goto fail_start; + } + ret = register_trace_power_mark(probe_power_mark); + if (ret) { + pr_info("power trace: Couldn't activate tracepoint" + " probe to trace_power_mark\n"); + goto fail_end; + } + return ret; +fail_end: + unregister_trace_power_end(probe_power_end); +fail_start: + unregister_trace_power_start(probe_power_start); + return ret; +} static void start_power_trace(struct trace_array *tr) { trace_power_enabled = 1; + tracing_power_register(); } static void stop_power_trace(struct trace_array *tr) { trace_power_enabled = 0; + unregister_trace_power_start(probe_power_start); + unregister_trace_power_end(probe_power_end); + unregister_trace_power_mark(probe_power_mark); } @@ -38,6 +140,7 @@ static int power_trace_init(struct trace power_trace = tr; trace_power_enabled = 1; + tracing_power_register(); for_each_cpu(cpu, cpu_possible_mask) tracing_reset(tr, cpu); @@ -94,86 +197,3 @@ static int init_power_trace(void) return register_tracer(&power_tracer); } device_initcall(init_power_trace); - -void trace_power_start(struct power_trace *it, unsigned int type, - unsigned int level) -{ - if (!trace_power_enabled) - return; - - memset(it, 0, sizeof(struct power_trace)); - it->state = level; - it->type = type; - it->stamp = ktime_get(); -} -EXPORT_SYMBOL_GPL(trace_power_start); - - -void trace_power_end(struct power_trace *it) -{ - struct ring_buffer_event *event; - struct trace_power *entry; - struct trace_array_cpu *data; - unsigned long irq_flags; - struct trace_array *tr = power_trace; - - if (!trace_power_enabled) - return; - - preempt_disable(); - it->end = ktime_get(); - data = tr->data[smp_processor_id()]; - - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); - if (!event) - goto out; - entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, 0); - entry->ent.type = TRACE_POWER; - entry->state_data = *it; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); - - out: - preempt_enable(); -} -EXPORT_SYMBOL_GPL(trace_power_end); - -void trace_power_mark(struct power_trace *it, unsigned int type, - unsigned int level) -{ - struct ring_buffer_event *event; - struct trace_power *entry; - struct trace_array_cpu *data; - unsigned long irq_flags; - struct trace_array *tr = power_trace; - - if (!trace_power_enabled) - return; - - memset(it, 0, sizeof(struct power_trace)); - it->state = level; - it->type = type; - it->stamp = ktime_get(); - preempt_disable(); - it->end = it->stamp; - data = tr->data[smp_processor_id()]; - - event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), - &irq_flags); - if (!event) - goto out; - entry = ring_buffer_event_data(event); - tracing_generic_entry_update(&entry->ent, 0, 0); - entry->ent.type = TRACE_POWER; - entry->state_data = *it; - ring_buffer_unlock_commit(tr->buffer, event, irq_flags); - - trace_wake_up(); - - out: - preempt_enable(); -} -EXPORT_SYMBOL_GPL(trace_power_mark); Index: linux-2.6-tip/kernel/trace/trace_sched_switch.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_sched_switch.c +++ linux-2.6-tip/kernel/trace/trace_sched_switch.c @@ -43,7 +43,7 @@ probe_sched_switch(struct rq *__rq, stru data = ctx_trace->data[cpu]; if (likely(!atomic_read(&data->disabled))) - tracing_sched_switch_trace(ctx_trace, data, prev, next, flags, pc); + tracing_sched_switch_trace(ctx_trace, prev, next, flags, pc); local_irq_restore(flags); } @@ -66,7 +66,7 @@ probe_sched_wakeup(struct rq *__rq, stru data = ctx_trace->data[cpu]; if (likely(!atomic_read(&data->disabled))) - tracing_sched_wakeup_trace(ctx_trace, data, wakee, current, + tracing_sched_wakeup_trace(ctx_trace, wakee, current, flags, pc); local_irq_restore(flags); @@ -93,7 +93,7 @@ static int tracing_sched_register(void) ret = register_trace_sched_switch(probe_sched_switch); if (ret) { pr_info("sched trace: Couldn't activate tracepoint" - " probe to kernel_sched_schedule\n"); + " probe to kernel_sched_switch\n"); goto fail_deprobe_wake_new; } @@ -185,12 +185,6 @@ void tracing_sched_switch_assign_trace(s ctx_trace = tr; } -static void start_sched_trace(struct trace_array *tr) -{ - tracing_reset_online_cpus(tr); - tracing_start_sched_switch_record(); -} - static void stop_sched_trace(struct trace_array *tr) { tracing_stop_sched_switch_record(); @@ -199,7 +193,7 @@ static void stop_sched_trace(struct trac static int sched_switch_trace_init(struct trace_array *tr) { ctx_trace = tr; - start_sched_trace(tr); + tracing_start_sched_switch_record(); return 0; } @@ -227,6 +221,7 @@ static struct tracer sched_switch_trace .reset = sched_switch_trace_reset, .start = sched_switch_trace_start, .stop = sched_switch_trace_stop, + .wait_pipe = poll_wait_pipe, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_sched_switch, #endif Index: linux-2.6-tip/kernel/trace/trace_sched_wakeup.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_sched_wakeup.c +++ linux-2.6-tip/kernel/trace/trace_sched_wakeup.c @@ -25,9 +25,9 @@ static int __read_mostly tracer_enabled; static struct task_struct *wakeup_task; static int wakeup_cpu; static unsigned wakeup_prio = -1; +static int wakeup_rt; -static raw_spinlock_t wakeup_lock = - (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; +static __raw_spinlock_t wakeup_lock = __RAW_SPIN_LOCK_UNLOCKED; static void __wakeup_reset(struct trace_array *tr); @@ -71,7 +71,7 @@ wakeup_tracer_call(unsigned long ip, uns if (task_cpu(wakeup_task) != cpu) goto unlock; - trace_function(tr, data, ip, parent_ip, flags, pc); + trace_function(tr, ip, parent_ip, flags, pc); unlock: __raw_spin_unlock(&wakeup_lock); @@ -151,7 +151,8 @@ probe_wakeup_sched_switch(struct rq *rq, if (unlikely(!tracer_enabled || next != wakeup_task)) goto out_unlock; - trace_function(wakeup_trace, data, CALLER_ADDR1, CALLER_ADDR2, flags, pc); + trace_function(wakeup_trace, CALLER_ADDR1, CALLER_ADDR2, flags, pc); + tracing_sched_switch_trace(wakeup_trace, prev, next, flags, pc); /* * usecs conversion is slow so we try to delay the conversion @@ -182,13 +183,10 @@ out: static void __wakeup_reset(struct trace_array *tr) { - struct trace_array_cpu *data; int cpu; - for_each_possible_cpu(cpu) { - data = tr->data[cpu]; + for_each_possible_cpu(cpu) tracing_reset(tr, cpu); - } wakeup_cpu = -1; wakeup_prio = -1; @@ -213,6 +211,7 @@ static void wakeup_reset(struct trace_ar static void probe_wakeup(struct rq *rq, struct task_struct *p, int success) { + struct trace_array_cpu *data; int cpu = smp_processor_id(); unsigned long flags; long disabled; @@ -224,7 +223,7 @@ probe_wakeup(struct rq *rq, struct task_ tracing_record_cmdline(p); tracing_record_cmdline(current); - if (likely(!rt_task(p)) || + if ((wakeup_rt && !rt_task(p)) || p->prio >= wakeup_prio || p->prio >= current->prio) return; @@ -252,9 +251,10 @@ probe_wakeup(struct rq *rq, struct task_ local_save_flags(flags); - wakeup_trace->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu); - trace_function(wakeup_trace, wakeup_trace->data[wakeup_cpu], - CALLER_ADDR1, CALLER_ADDR2, flags, pc); + data = wakeup_trace->data[wakeup_cpu]; + data->preempt_timestamp = ftrace_now(cpu); + tracing_sched_wakeup_trace(wakeup_trace, p, current, flags, pc); + trace_function(wakeup_trace, CALLER_ADDR1, CALLER_ADDR2, flags, pc); out_locked: __raw_spin_unlock(&wakeup_lock); @@ -262,12 +262,6 @@ out: atomic_dec(&wakeup_trace->data[cpu]->disabled); } -/* - * save_tracer_enabled is used to save the state of the tracer_enabled - * variable when we disable it when we open a trace output file. - */ -static int save_tracer_enabled; - static void start_wakeup_tracer(struct trace_array *tr) { int ret; @@ -289,7 +283,7 @@ static void start_wakeup_tracer(struct t ret = register_trace_sched_switch(probe_wakeup_sched_switch); if (ret) { pr_info("sched trace: Couldn't activate tracepoint" - " probe to kernel_sched_schedule\n"); + " probe to kernel_sched_switch\n"); goto fail_deprobe_wake_new; } @@ -306,13 +300,10 @@ static void start_wakeup_tracer(struct t register_ftrace_function(&trace_ops); - if (tracing_is_enabled()) { + if (tracing_is_enabled()) tracer_enabled = 1; - save_tracer_enabled = 1; - } else { + else tracer_enabled = 0; - save_tracer_enabled = 0; - } return; fail_deprobe_wake_new: @@ -324,14 +315,13 @@ fail_deprobe: static void stop_wakeup_tracer(struct trace_array *tr) { tracer_enabled = 0; - save_tracer_enabled = 0; unregister_ftrace_function(&trace_ops); unregister_trace_sched_switch(probe_wakeup_sched_switch); unregister_trace_sched_wakeup_new(probe_wakeup); unregister_trace_sched_wakeup(probe_wakeup); } -static int wakeup_tracer_init(struct trace_array *tr) +static int __wakeup_tracer_init(struct trace_array *tr) { tracing_max_latency = 0; wakeup_trace = tr; @@ -339,6 +329,18 @@ static int wakeup_tracer_init(struct tra return 0; } +static int wakeup_tracer_init(struct trace_array *tr) +{ + wakeup_rt = 0; + return __wakeup_tracer_init(tr); +} + +static int wakeup_rt_tracer_init(struct trace_array *tr) +{ + wakeup_rt = 1; + return __wakeup_tracer_init(tr); +} + static void wakeup_tracer_reset(struct trace_array *tr) { stop_wakeup_tracer(tr); @@ -350,28 +352,11 @@ static void wakeup_tracer_start(struct t { wakeup_reset(tr); tracer_enabled = 1; - save_tracer_enabled = 1; } static void wakeup_tracer_stop(struct trace_array *tr) { tracer_enabled = 0; - save_tracer_enabled = 0; -} - -static void wakeup_tracer_open(struct trace_iterator *iter) -{ - /* stop the trace while dumping */ - tracer_enabled = 0; -} - -static void wakeup_tracer_close(struct trace_iterator *iter) -{ - /* forget about any processes we were recording */ - if (save_tracer_enabled) { - wakeup_reset(iter->tr); - tracer_enabled = 1; - } } static struct tracer wakeup_tracer __read_mostly = @@ -381,8 +366,20 @@ static struct tracer wakeup_tracer __rea .reset = wakeup_tracer_reset, .start = wakeup_tracer_start, .stop = wakeup_tracer_stop, - .open = wakeup_tracer_open, - .close = wakeup_tracer_close, + .print_max = 1, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_wakeup, +#endif +}; + +static struct tracer wakeup_rt_tracer __read_mostly = +{ + .name = "wakeup_rt", + .init = wakeup_rt_tracer_init, + .reset = wakeup_tracer_reset, + .start = wakeup_tracer_start, + .stop = wakeup_tracer_stop, + .wait_pipe = poll_wait_pipe, .print_max = 1, #ifdef CONFIG_FTRACE_SELFTEST .selftest = trace_selftest_startup_wakeup, @@ -397,6 +394,10 @@ __init static int init_wakeup_tracer(voi if (ret) return ret; + ret = register_tracer(&wakeup_rt_tracer); + if (ret) + return ret; + return 0; } device_initcall(init_wakeup_tracer); Index: linux-2.6-tip/kernel/trace/trace_selftest.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_selftest.c +++ linux-2.6-tip/kernel/trace/trace_selftest.c @@ -9,11 +9,12 @@ static inline int trace_valid_entry(stru case TRACE_FN: case TRACE_CTX: case TRACE_WAKE: - case TRACE_CONT: case TRACE_STACK: case TRACE_PRINT: case TRACE_SPECIAL: case TRACE_BRANCH: + case TRACE_GRAPH_ENT: + case TRACE_GRAPH_RET: return 1; } return 0; @@ -125,9 +126,9 @@ int trace_selftest_startup_dynamic_traci func(); /* - * Some archs *cough*PowerPC*cough* add charachters to the + * Some archs *cough*PowerPC*cough* add characters to the * start of the function names. We simply put a '*' to - * accomodate them. + * accommodate them. */ func_name = "*" STR(DYN_FTRACE_TEST_NAME); @@ -135,7 +136,7 @@ int trace_selftest_startup_dynamic_traci ftrace_set_filter(func_name, strlen(func_name), 1); /* enable tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); goto out; @@ -209,7 +210,7 @@ trace_selftest_startup_function(struct t ftrace_enabled = 1; tracer_enabled = 1; - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); goto out; @@ -247,6 +248,54 @@ trace_selftest_startup_function(struct t } #endif /* CONFIG_FUNCTION_TRACER */ + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +/* + * Pretty much the same than for the function tracer from which the selftest + * has been borrowed. + */ +int +trace_selftest_startup_function_graph(struct tracer *trace, + struct trace_array *tr) +{ + int ret; + unsigned long count; + + ret = tracer_init(trace, tr); + if (ret) { + warn_failed_init_tracer(trace, ret); + goto out; + } + + /* Sleep for a 1/10 of a second */ + msleep(100); + + tracing_stop(); + + /* check the trace buffer */ + ret = trace_test_buffer(tr, &count); + + trace->reset(tr); + tracing_start(); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + goto out; + } + + /* Don't test dynamic tracing, the function tracer already did */ + +out: + /* Stop it if we failed */ + if (ret) + ftrace_graph_stop(); + + return ret; +} +#endif /* CONFIG_FUNCTION_GRAPH_TRACER */ + + #ifdef CONFIG_IRQSOFF_TRACER int trace_selftest_startup_irqsoff(struct tracer *trace, struct trace_array *tr) @@ -256,7 +305,7 @@ trace_selftest_startup_irqsoff(struct tr int ret; /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); return ret; @@ -310,7 +359,7 @@ trace_selftest_startup_preemptoff(struct } /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); return ret; @@ -364,7 +413,7 @@ trace_selftest_startup_preemptirqsoff(st } /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); goto out; @@ -496,7 +545,7 @@ trace_selftest_startup_wakeup(struct tra wait_for_completion(&isrt); /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); return ret; @@ -557,7 +606,7 @@ trace_selftest_startup_sched_switch(stru int ret; /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); return ret; @@ -589,10 +638,10 @@ trace_selftest_startup_sysprof(struct tr int ret; /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); - return 0; + return ret; } /* Sleep for a 1/10 of a second */ @@ -604,6 +653,11 @@ trace_selftest_startup_sysprof(struct tr trace->reset(tr); tracing_start(); + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + return ret; } #endif /* CONFIG_SYSPROF_TRACER */ @@ -616,7 +670,7 @@ trace_selftest_startup_branch(struct tra int ret; /* start the tracing */ - ret = trace->init(tr); + ret = tracer_init(trace, tr); if (ret) { warn_failed_init_tracer(trace, ret); return ret; @@ -631,6 +685,11 @@ trace_selftest_startup_branch(struct tra trace->reset(tr); tracing_start(); + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + return ret; } #endif /* CONFIG_BRANCH_TRACER */ Index: linux-2.6-tip/kernel/trace/trace_stat.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/trace_stat.c @@ -0,0 +1,319 @@ +/* + * Infrastructure for statistic tracing (histogram output). + * + * Copyright (C) 2008 Frederic Weisbecker + * + * Based on the code from trace_branch.c which is + * Copyright (C) 2008 Steven Rostedt + * + */ + + +#include +#include +#include "trace_stat.h" +#include "trace.h" + + +/* List of stat entries from a tracer */ +struct trace_stat_list { + struct list_head list; + void *stat; +}; + +/* A stat session is the stats output in one file */ +struct tracer_stat_session { + struct list_head session_list; + struct tracer_stat *ts; + struct list_head stat_list; + struct mutex stat_mutex; + struct dentry *file; +}; + +/* All of the sessions currently in use. Each stat file embed one session */ +static LIST_HEAD(all_stat_sessions); +static DEFINE_MUTEX(all_stat_sessions_mutex); + +/* The root directory for all stat files */ +static struct dentry *stat_dir; + + +static void reset_stat_session(struct tracer_stat_session *session) +{ + struct trace_stat_list *node, *next; + + list_for_each_entry_safe(node, next, &session->stat_list, list) + kfree(node); + + INIT_LIST_HEAD(&session->stat_list); +} + +static void destroy_session(struct tracer_stat_session *session) +{ + debugfs_remove(session->file); + reset_stat_session(session); + mutex_destroy(&session->stat_mutex); + kfree(session); +} + +/* + * For tracers that don't provide a stat_cmp callback. + * This one will force an immediate insertion on tail of + * the list. + */ +static int dummy_cmp(void *p1, void *p2) +{ + return 1; +} + +/* + * Initialize the stat list at each trace_stat file opening. + * All of these copies and sorting are required on all opening + * since the stats could have changed between two file sessions. + */ +static int stat_seq_init(struct tracer_stat_session *session) +{ + struct trace_stat_list *iter_entry, *new_entry; + struct tracer_stat *ts = session->ts; + void *prev_stat; + int ret = 0; + int i; + + mutex_lock(&session->stat_mutex); + reset_stat_session(session); + + if (!ts->stat_cmp) + ts->stat_cmp = dummy_cmp; + + /* + * The first entry. Actually this is the second, but the first + * one (the stat_list head) is pointless. + */ + new_entry = kmalloc(sizeof(struct trace_stat_list), GFP_KERNEL); + if (!new_entry) { + ret = -ENOMEM; + goto exit; + } + + INIT_LIST_HEAD(&new_entry->list); + + list_add(&new_entry->list, &session->stat_list); + + new_entry->stat = ts->stat_start(); + prev_stat = new_entry->stat; + + /* + * Iterate over the tracer stat entries and store them in a sorted + * list. + */ + for (i = 1; ; i++) { + new_entry = kmalloc(sizeof(struct trace_stat_list), GFP_KERNEL); + if (!new_entry) { + ret = -ENOMEM; + goto exit_free_list; + } + + INIT_LIST_HEAD(&new_entry->list); + new_entry->stat = ts->stat_next(prev_stat, i); + + /* End of insertion */ + if (!new_entry->stat) + break; + + list_for_each_entry(iter_entry, &session->stat_list, list) { + + /* Insertion with a descendent sorting */ + if (ts->stat_cmp(new_entry->stat, + iter_entry->stat) > 0) { + + list_add_tail(&new_entry->list, + &iter_entry->list); + break; + + /* The current smaller value */ + } else if (list_is_last(&iter_entry->list, + &session->stat_list)) { + list_add(&new_entry->list, &iter_entry->list); + break; + } + } + + prev_stat = new_entry->stat; + } +exit: + mutex_unlock(&session->stat_mutex); + return ret; + +exit_free_list: + reset_stat_session(session); + mutex_unlock(&session->stat_mutex); + return ret; +} + + +static void *stat_seq_start(struct seq_file *s, loff_t *pos) +{ + struct tracer_stat_session *session = s->private; + + /* Prevent from tracer switch or stat_list modification */ + mutex_lock(&session->stat_mutex); + + /* If we are in the beginning of the file, print the headers */ + if (!*pos && session->ts->stat_headers) + session->ts->stat_headers(s); + + return seq_list_start(&session->stat_list, *pos); +} + +static void *stat_seq_next(struct seq_file *s, void *p, loff_t *pos) +{ + struct tracer_stat_session *session = s->private; + + return seq_list_next(p, &session->stat_list, pos); +} + +static void stat_seq_stop(struct seq_file *s, void *p) +{ + struct tracer_stat_session *session = s->private; + mutex_unlock(&session->stat_mutex); +} + +static int stat_seq_show(struct seq_file *s, void *v) +{ + struct tracer_stat_session *session = s->private; + struct trace_stat_list *l = list_entry(v, struct trace_stat_list, list); + + return session->ts->stat_show(s, l->stat); +} + +static const struct seq_operations trace_stat_seq_ops = { + .start = stat_seq_start, + .next = stat_seq_next, + .stop = stat_seq_stop, + .show = stat_seq_show +}; + +/* The session stat is refilled and resorted at each stat file opening */ +static int tracing_stat_open(struct inode *inode, struct file *file) +{ + int ret; + + struct tracer_stat_session *session = inode->i_private; + + ret = seq_open(file, &trace_stat_seq_ops); + if (!ret) { + struct seq_file *m = file->private_data; + m->private = session; + ret = stat_seq_init(session); + } + + return ret; +} + +/* + * Avoid consuming memory with our now useless list. + */ +static int tracing_stat_release(struct inode *i, struct file *f) +{ + struct tracer_stat_session *session = i->i_private; + + mutex_lock(&session->stat_mutex); + reset_stat_session(session); + mutex_unlock(&session->stat_mutex); + + return 0; +} + +static const struct file_operations tracing_stat_fops = { + .open = tracing_stat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = tracing_stat_release +}; + +static int tracing_stat_init(void) +{ + struct dentry *d_tracing; + + d_tracing = tracing_init_dentry(); + + stat_dir = debugfs_create_dir("trace_stat", d_tracing); + if (!stat_dir) + pr_warning("Could not create debugfs " + "'trace_stat' entry\n"); + return 0; +} + +static int init_stat_file(struct tracer_stat_session *session) +{ + if (!stat_dir && tracing_stat_init()) + return -ENODEV; + + session->file = debugfs_create_file(session->ts->name, 0644, + stat_dir, + session, &tracing_stat_fops); + if (!session->file) + return -ENOMEM; + return 0; +} + +int register_stat_tracer(struct tracer_stat *trace) +{ + struct tracer_stat_session *session, *node, *tmp; + int ret; + + if (!trace) + return -EINVAL; + + if (!trace->stat_start || !trace->stat_next || !trace->stat_show) + return -EINVAL; + + /* Already registered? */ + mutex_lock(&all_stat_sessions_mutex); + list_for_each_entry_safe(node, tmp, &all_stat_sessions, session_list) { + if (node->ts == trace) { + mutex_unlock(&all_stat_sessions_mutex); + return -EINVAL; + } + } + mutex_unlock(&all_stat_sessions_mutex); + + /* Init the session */ + session = kmalloc(sizeof(struct tracer_stat_session), GFP_KERNEL); + if (!session) + return -ENOMEM; + + session->ts = trace; + INIT_LIST_HEAD(&session->session_list); + INIT_LIST_HEAD(&session->stat_list); + mutex_init(&session->stat_mutex); + session->file = NULL; + + ret = init_stat_file(session); + if (ret) { + destroy_session(session); + return ret; + } + + /* Register */ + mutex_lock(&all_stat_sessions_mutex); + list_add_tail(&session->session_list, &all_stat_sessions); + mutex_unlock(&all_stat_sessions_mutex); + + return 0; +} + +void unregister_stat_tracer(struct tracer_stat *trace) +{ + struct tracer_stat_session *node, *tmp; + + mutex_lock(&all_stat_sessions_mutex); + list_for_each_entry_safe(node, tmp, &all_stat_sessions, session_list) { + if (node->ts == trace) { + list_del(&node->session_list); + destroy_session(node); + break; + } + } + mutex_unlock(&all_stat_sessions_mutex); +} Index: linux-2.6-tip/kernel/trace/trace_stat.h =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/trace_stat.h @@ -0,0 +1,31 @@ +#ifndef __TRACE_STAT_H +#define __TRACE_STAT_H + +#include + +/* + * If you want to provide a stat file (one-shot statistics), fill + * an iterator with stat_start/stat_next and a stat_show callbacks. + * The others callbacks are optional. + */ +struct tracer_stat { + /* The name of your stat file */ + const char *name; + /* Iteration over statistic entries */ + void *(*stat_start)(void); + void *(*stat_next)(void *prev, int idx); + /* Compare two entries for stats sorting */ + int (*stat_cmp)(void *p1, void *p2); + /* Print a stat entry */ + int (*stat_show)(struct seq_file *s, void *p); + /* Print the headers of your stat entries */ + int (*stat_headers)(struct seq_file *s); +}; + +/* + * Destroy or create a stat file + */ +extern int register_stat_tracer(struct tracer_stat *trace); +extern void unregister_stat_tracer(struct tracer_stat *trace); + +#endif /* __TRACE_STAT_H */ Index: linux-2.6-tip/kernel/trace/trace_sysprof.c =================================================================== --- linux-2.6-tip.orig/kernel/trace/trace_sysprof.c +++ linux-2.6-tip/kernel/trace/trace_sysprof.c @@ -88,7 +88,7 @@ static void backtrace_address(void *data } } -const static struct stacktrace_ops backtrace_ops = { +static const struct stacktrace_ops backtrace_ops = { .warning = backtrace_warning, .warning_symbol = backtrace_warning_symbol, .stack = backtrace_stack, @@ -226,15 +226,6 @@ static void stop_stack_timers(void) stop_stack_timer(cpu); } -static void start_stack_trace(struct trace_array *tr) -{ - mutex_lock(&sample_timer_lock); - tracing_reset_online_cpus(tr); - start_stack_timers(); - tracer_enabled = 1; - mutex_unlock(&sample_timer_lock); -} - static void stop_stack_trace(struct trace_array *tr) { mutex_lock(&sample_timer_lock); @@ -247,12 +238,18 @@ static int stack_trace_init(struct trace { sysprof_trace = tr; - start_stack_trace(tr); + tracing_start_cmdline_record(); + + mutex_lock(&sample_timer_lock); + start_stack_timers(); + tracer_enabled = 1; + mutex_unlock(&sample_timer_lock); return 0; } static void stack_trace_reset(struct trace_array *tr) { + tracing_stop_cmdline_record(); stop_stack_trace(tr); } @@ -330,5 +327,5 @@ void init_tracer_sysprof_debugfs(struct d_tracer, NULL, &sysprof_sample_fops); if (entry) return; - pr_warning("Could not create debugfs 'dyn_ftrace_total_info' entry\n"); + pr_warning("Could not create debugfs 'sysprof_sample_period' entry\n"); } Index: linux-2.6-tip/kernel/trace/trace_workqueue.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/trace/trace_workqueue.c @@ -0,0 +1,281 @@ +/* + * Workqueue statistical tracer. + * + * Copyright (C) 2008 Frederic Weisbecker + * + */ + + +#include +#include +#include +#include "trace_stat.h" +#include "trace.h" + + +/* A cpu workqueue thread */ +struct cpu_workqueue_stats { + struct list_head list; +/* Useful to know if we print the cpu headers */ + bool first_entry; + int cpu; + pid_t pid; +/* Can be inserted from interrupt or user context, need to be atomic */ + atomic_t inserted; +/* + * Don't need to be atomic, works are serialized in a single workqueue thread + * on a single CPU. + */ + unsigned int executed; +}; + +/* List of workqueue threads on one cpu */ +struct workqueue_global_stats { + struct list_head list; + spinlock_t lock; +}; + +/* Don't need a global lock because allocated before the workqueues, and + * never freed. + */ +static DEFINE_PER_CPU(struct workqueue_global_stats, all_workqueue_stat); +#define workqueue_cpu_stat(cpu) (&per_cpu(all_workqueue_stat, cpu)) + +/* Insertion of a work */ +static void +probe_workqueue_insertion(struct task_struct *wq_thread, + struct work_struct *work) +{ + int cpu = cpumask_first(&wq_thread->cpus_allowed); + struct cpu_workqueue_stats *node, *next; + unsigned long flags; + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + list_for_each_entry_safe(node, next, &workqueue_cpu_stat(cpu)->list, + list) { + if (node->pid == wq_thread->pid) { + atomic_inc(&node->inserted); + goto found; + } + } + pr_debug("trace_workqueue: entry not found\n"); +found: + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); +} + +/* Execution of a work */ +static void +probe_workqueue_execution(struct task_struct *wq_thread, + struct work_struct *work) +{ + int cpu = cpumask_first(&wq_thread->cpus_allowed); + struct cpu_workqueue_stats *node, *next; + unsigned long flags; + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + list_for_each_entry_safe(node, next, &workqueue_cpu_stat(cpu)->list, + list) { + if (node->pid == wq_thread->pid) { + node->executed++; + goto found; + } + } + pr_debug("trace_workqueue: entry not found\n"); +found: + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); +} + +/* Creation of a cpu workqueue thread */ +static void probe_workqueue_creation(struct task_struct *wq_thread, int cpu) +{ + struct cpu_workqueue_stats *cws; + unsigned long flags; + + WARN_ON(cpu < 0 || cpu >= num_possible_cpus()); + + /* Workqueues are sometimes created in atomic context */ + cws = kzalloc(sizeof(struct cpu_workqueue_stats), GFP_ATOMIC); + if (!cws) { + pr_warning("trace_workqueue: not enough memory\n"); + return; + } + tracing_record_cmdline(wq_thread); + + INIT_LIST_HEAD(&cws->list); + cws->cpu = cpu; + + cws->pid = wq_thread->pid; + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + if (list_empty(&workqueue_cpu_stat(cpu)->list)) + cws->first_entry = true; + list_add_tail(&cws->list, &workqueue_cpu_stat(cpu)->list); + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); +} + +/* Destruction of a cpu workqueue thread */ +static void probe_workqueue_destruction(struct task_struct *wq_thread) +{ + /* Workqueue only execute on one cpu */ + int cpu = cpumask_first(&wq_thread->cpus_allowed); + struct cpu_workqueue_stats *node, *next; + unsigned long flags; + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + list_for_each_entry_safe(node, next, &workqueue_cpu_stat(cpu)->list, + list) { + if (node->pid == wq_thread->pid) { + list_del(&node->list); + kfree(node); + goto found; + } + } + + pr_debug("trace_workqueue: don't find workqueue to destroy\n"); +found: + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); + +} + +static struct cpu_workqueue_stats *workqueue_stat_start_cpu(int cpu) +{ + unsigned long flags; + struct cpu_workqueue_stats *ret = NULL; + + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + + if (!list_empty(&workqueue_cpu_stat(cpu)->list)) + ret = list_entry(workqueue_cpu_stat(cpu)->list.next, + struct cpu_workqueue_stats, list); + + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); + + return ret; +} + +static void *workqueue_stat_start(void) +{ + int cpu; + void *ret = NULL; + + for_each_possible_cpu(cpu) { + ret = workqueue_stat_start_cpu(cpu); + if (ret) + return ret; + } + return NULL; +} + +static void *workqueue_stat_next(void *prev, int idx) +{ + struct cpu_workqueue_stats *prev_cws = prev; + int cpu = prev_cws->cpu; + unsigned long flags; + void *ret = NULL; + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + if (list_is_last(&prev_cws->list, &workqueue_cpu_stat(cpu)->list)) { + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); + for (++cpu ; cpu < num_possible_cpus(); cpu++) { + ret = workqueue_stat_start_cpu(cpu); + if (ret) + return ret; + } + return NULL; + } + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); + + return list_entry(prev_cws->list.next, struct cpu_workqueue_stats, + list); +} + +static int workqueue_stat_show(struct seq_file *s, void *p) +{ + struct cpu_workqueue_stats *cws = p; + unsigned long flags; + int cpu = cws->cpu; + + seq_printf(s, "%3d %6d %6u %s\n", cws->cpu, + atomic_read(&cws->inserted), + cws->executed, + trace_find_cmdline(cws->pid)); + + spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); + if (&cws->list == workqueue_cpu_stat(cpu)->list.next) + seq_printf(s, "\n"); + spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); + + return 0; +} + +static int workqueue_stat_headers(struct seq_file *s) +{ + seq_printf(s, "# CPU INSERTED EXECUTED NAME\n"); + seq_printf(s, "# | | | |\n\n"); + return 0; +} + +struct tracer_stat workqueue_stats __read_mostly = { + .name = "workqueues", + .stat_start = workqueue_stat_start, + .stat_next = workqueue_stat_next, + .stat_show = workqueue_stat_show, + .stat_headers = workqueue_stat_headers +}; + + +int __init stat_workqueue_init(void) +{ + if (register_stat_tracer(&workqueue_stats)) { + pr_warning("Unable to register workqueue stat tracer\n"); + return 1; + } + + return 0; +} +fs_initcall(stat_workqueue_init); + +/* + * Workqueues are created very early, just after pre-smp initcalls. + * So we must register our tracepoints at this stage. + */ +int __init trace_workqueue_early_init(void) +{ + int ret, cpu; + + ret = register_trace_workqueue_insertion(probe_workqueue_insertion); + if (ret) + goto out; + + ret = register_trace_workqueue_execution(probe_workqueue_execution); + if (ret) + goto no_insertion; + + ret = register_trace_workqueue_creation(probe_workqueue_creation); + if (ret) + goto no_execution; + + ret = register_trace_workqueue_destruction(probe_workqueue_destruction); + if (ret) + goto no_creation; + + for_each_possible_cpu(cpu) { + spin_lock_init(&workqueue_cpu_stat(cpu)->lock); + INIT_LIST_HEAD(&workqueue_cpu_stat(cpu)->list); + } + + return 0; + +no_creation: + unregister_trace_workqueue_creation(probe_workqueue_creation); +no_execution: + unregister_trace_workqueue_execution(probe_workqueue_execution); +no_insertion: + unregister_trace_workqueue_insertion(probe_workqueue_insertion); +out: + pr_warning("trace_workqueue: unable to trace workqueues\n"); + + return 1; +} +early_initcall(trace_workqueue_early_init); Index: linux-2.6-tip/kernel/user_namespace.c =================================================================== --- linux-2.6-tip.orig/kernel/user_namespace.c +++ linux-2.6-tip/kernel/user_namespace.c @@ -60,12 +60,25 @@ int create_user_ns(struct cred *new) return 0; } -void free_user_ns(struct kref *kref) +/* + * Deferred destructor for a user namespace. This is required because + * free_user_ns() may be called with uidhash_lock held, but we need to call + * back to free_uid() which will want to take the lock again. + */ +static void free_user_ns_work(struct work_struct *work) { - struct user_namespace *ns; - - ns = container_of(kref, struct user_namespace, kref); + struct user_namespace *ns = + container_of(work, struct user_namespace, destroyer); free_uid(ns->creator); kfree(ns); } + +void free_user_ns(struct kref *kref) +{ + struct user_namespace *ns = + container_of(kref, struct user_namespace, kref); + + INIT_WORK(&ns->destroyer, free_user_ns_work); + schedule_work(&ns->destroyer); +} EXPORT_SYMBOL(free_user_ns); Index: linux-2.6-tip/kernel/workqueue.c =================================================================== --- linux-2.6-tip.orig/kernel/workqueue.c +++ linux-2.6-tip/kernel/workqueue.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -33,6 +34,9 @@ #include #include #include +#include + +#include /* * The per-CPU workqueue (if single thread, we always use the first @@ -125,9 +129,13 @@ struct cpu_workqueue_struct *get_wq_data return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK); } +DEFINE_TRACE(workqueue_insertion); + static void insert_work(struct cpu_workqueue_struct *cwq, struct work_struct *work, struct list_head *head) { + trace_workqueue_insertion(cwq->thread, work); + set_wq_data(work, cwq); /* * Ensure that we get the right work->data if we see the @@ -157,13 +165,14 @@ static void __queue_work(struct cpu_work * * We queue the work to the CPU on which it was submitted, but if the CPU dies * it can be processed by another CPU. + * + * Especially no such guarantee on PREEMPT_RT. */ int queue_work(struct workqueue_struct *wq, struct work_struct *work) { - int ret; + int ret = 0, cpu = raw_smp_processor_id(); - ret = queue_work_on(get_cpu(), wq, work); - put_cpu(); + ret = queue_work_on(cpu, wq, work); return ret; } @@ -200,7 +209,7 @@ static void delayed_work_timer_fn(unsign struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work); struct workqueue_struct *wq = cwq->wq; - __queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work); + __queue_work(wq_per_cpu(wq, raw_smp_processor_id()), &dwork->work); } /** @@ -259,6 +268,8 @@ int queue_delayed_work_on(int cpu, struc } EXPORT_SYMBOL_GPL(queue_delayed_work_on); +DEFINE_TRACE(workqueue_execution); + static void run_workqueue(struct cpu_workqueue_struct *cwq) { spin_lock_irq(&cwq->lock); @@ -284,7 +295,7 @@ static void run_workqueue(struct cpu_wor */ struct lockdep_map lockdep_map = work->lockdep_map; #endif - + trace_workqueue_execution(cwq->thread, work); cwq->current_work = work; list_del_init(cwq->worklist.next); spin_unlock_irq(&cwq->lock); @@ -765,6 +776,8 @@ init_cpu_workqueue(struct workqueue_stru return cwq; } +DEFINE_TRACE(workqueue_creation); + static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; @@ -787,6 +800,8 @@ static int create_workqueue_thread(struc sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m); cwq->thread = p; + trace_workqueue_creation(cwq->thread, cpu); + return 0; } @@ -868,6 +883,8 @@ struct workqueue_struct *__create_workqu } EXPORT_SYMBOL_GPL(__create_workqueue_key); +DEFINE_TRACE(workqueue_destruction); + static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq) { /* @@ -891,10 +908,54 @@ static void cleanup_workqueue_thread(str * checks list_empty(), and a "normal" queue_work() can't use * a dead CPU. */ + trace_workqueue_destruction(cwq->thread); kthread_stop(cwq->thread); cwq->thread = NULL; } +void set_workqueue_thread_prio(struct workqueue_struct *wq, int cpu, + int policy, int rt_priority, int nice) +{ + struct sched_param param = { .sched_priority = rt_priority }; + struct cpu_workqueue_struct *cwq; + mm_segment_t oldfs = get_fs(); + struct task_struct *p; + unsigned long flags; + int ret; + + cwq = per_cpu_ptr(wq->cpu_wq, cpu); + spin_lock_irqsave(&cwq->lock, flags); + p = cwq->thread; + spin_unlock_irqrestore(&cwq->lock, flags); + + set_user_nice(p, nice); + + set_fs(KERNEL_DS); + ret = sys_sched_setscheduler(p->pid, policy, ¶m); + set_fs(oldfs); + + WARN_ON(ret); +} + +void set_workqueue_prio(struct workqueue_struct *wq, int policy, + int rt_priority, int nice) +{ + int cpu; + + /* We don't need the distraction of CPUs appearing and vanishing. */ + get_online_cpus(); + spin_lock(&workqueue_lock); + if (is_wq_single_threaded(wq)) + set_workqueue_thread_prio(wq, 0, policy, rt_priority, nice); + else { + for_each_online_cpu(cpu) + set_workqueue_thread_prio(wq, cpu, policy, + rt_priority, nice); + } + spin_unlock(&workqueue_lock); + put_online_cpus(); +} + /** * destroy_workqueue - safely terminate a workqueue * @wq: target workqueue @@ -1021,6 +1082,7 @@ void __init init_workqueues(void) hotcpu_notifier(workqueue_cpu_callback, 0); keventd_wq = create_workqueue("events"); BUG_ON(!keventd_wq); + set_workqueue_prio(keventd_wq, SCHED_FIFO, 1, -20); #ifdef CONFIG_SMP work_on_cpu_wq = create_workqueue("work_on_cpu"); BUG_ON(!work_on_cpu_wq); Index: linux-2.6-tip/lib/Kconfig =================================================================== --- linux-2.6-tip.orig/lib/Kconfig +++ linux-2.6-tip/lib/Kconfig @@ -136,12 +136,6 @@ config TEXTSEARCH_BM config TEXTSEARCH_FSM tristate -# -# plist support is select#ed if needed -# -config PLIST - boolean - config HAS_IOMEM boolean depends on !NO_IOMEM Index: linux-2.6-tip/lib/Kconfig.debug =================================================================== --- linux-2.6-tip.orig/lib/Kconfig.debug +++ linux-2.6-tip/lib/Kconfig.debug @@ -9,8 +9,20 @@ config PRINTK_TIME operations. This is useful for identifying long delays in kernel startup. +config ALLOW_WARNINGS + bool "Continue building despite compiler warnings" + default y + help + By disabling this option you will enable -Werror on building C + files. This causes all warnings to abort the compilation, just as + errors do. (It is generally not recommended to disable this option as + the overwhelming majority of warnings is harmless and also gcc puts + out false-positive warnings. It is useful for automated testing + though.) + config ENABLE_WARN_DEPRECATED bool "Enable __deprecated logic" + depends on ALLOW_WARNINGS default y help Enable the __deprecated logic in the kernel build. @@ -19,12 +31,13 @@ config ENABLE_WARN_DEPRECATED config ENABLE_MUST_CHECK bool "Enable __must_check logic" - default y + depends on ALLOW_WARNINGS help Enable the __must_check logic in the kernel build. Disable this to suppress the "warning: ignoring return value of 'foo', declared with attribute warn_unused_result" messages. + config FRAME_WARN int "Warn for stack frames larger than (needs gcc 4.4)" range 0 8192 @@ -95,7 +108,6 @@ config HEADERS_CHECK config DEBUG_SECTION_MISMATCH bool "Enable full Section mismatch analysis" - depends on UNDEFINED # This option is on purpose disabled for now. # It will be enabled when we are down to a resonable number # of section mismatch warnings (< 10 for an allyesconfig build) @@ -186,6 +198,44 @@ config BOOTPARAM_SOFTLOCKUP_PANIC_VALUE default 0 if !BOOTPARAM_SOFTLOCKUP_PANIC default 1 if BOOTPARAM_SOFTLOCKUP_PANIC +config DETECT_HUNG_TASK + bool "Detect Hung Tasks" + depends on DEBUG_KERNEL + default y + help + Say Y here to enable the kernel to detect "hung tasks", + which are bugs that cause the task to be stuck in + uninterruptible "D" state indefinitiley. + + When a hung task is detected, the kernel will print the + current stack trace (which you should report), but the + task will stay in uninterruptible state. If lockdep is + enabled then all held locks will also be reported. This + feature has negligible overhead. + +config BOOTPARAM_HUNG_TASK_PANIC + bool "Panic (Reboot) On Hung Tasks" + depends on DETECT_HUNG_TASK + help + Say Y here to enable the kernel to panic on "hung tasks", + which are bugs that cause the kernel to leave a task stuck + in uninterruptible "D" state. + + The panic can be used in combination with panic_timeout, + to cause the system to reboot automatically after a + hung task has been detected. This feature is useful for + high-availability systems that have uptime guarantees and + where a hung tasks must be resolved ASAP. + + Say N if unsure. + +config BOOTPARAM_HUNG_TASK_PANIC_VALUE + int + depends on DETECT_HUNG_TASK + range 0 1 + default 0 if !BOOTPARAM_HUNG_TASK_PANIC + default 1 if BOOTPARAM_HUNG_TASK_PANIC + config SCHED_DEBUG bool "Collect scheduler debugging info" depends on DEBUG_KERNEL && PROC_FS @@ -314,6 +364,8 @@ config DEBUG_RT_MUTEXES help This allows rt mutex semantics violations and rt mutex related deadlocks (lockups) to be detected and reported automatically. + When realtime preemption is enabled this includes spinlocks, + rwlocks, mutexes and (rw)semaphores config DEBUG_PI_LIST bool @@ -337,7 +389,7 @@ config DEBUG_SPINLOCK config DEBUG_MUTEXES bool "Mutex debugging: basic checks" - depends on DEBUG_KERNEL + depends on DEBUG_KERNEL && !PREEMPT_RT help This feature allows mutex semantics violations to be detected and reported. Index: linux-2.6-tip/lib/Makefile =================================================================== --- linux-2.6-tip.orig/lib/Makefile +++ linux-2.6-tip/lib/Makefile @@ -11,7 +11,8 @@ lib-y := ctype.o string.o vsprintf.o cmd rbtree.o radix-tree.o dump_stack.o \ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o irq_regs.o reciprocal_div.o argv_split.o \ - proportions.o prio_heap.o ratelimit.o show_mem.o is_single_threaded.o + proportions.o prio_heap.o ratelimit.o show_mem.o \ + is_single_threaded.o plist.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o @@ -33,14 +34,14 @@ obj-$(CONFIG_HAS_IOMEM) += iomap_copy.o obj-$(CONFIG_CHECK_SIGNATURE) += check_signature.o obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o -lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o +obj-$(CONFIG_PREEMPT_RT) += plist.o +obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o lib-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o -obj-$(CONFIG_PLIST) += plist.o obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o obj-$(CONFIG_DEBUG_LIST) += list_debug.o obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o Index: linux-2.6-tip/lib/swiotlb.c =================================================================== --- linux-2.6-tip.orig/lib/swiotlb.c +++ linux-2.6-tip/lib/swiotlb.c @@ -145,7 +145,7 @@ static void *swiotlb_bus_to_virt(dma_add return phys_to_virt(swiotlb_bus_to_phys(address)); } -int __weak swiotlb_arch_range_needs_mapping(void *ptr, size_t size) +int __weak swiotlb_arch_range_needs_mapping(phys_addr_t paddr, size_t size) { return 0; } @@ -315,9 +315,9 @@ address_needs_mapping(struct device *hwd return !is_buffer_dma_capable(dma_get_mask(hwdev), addr, size); } -static inline int range_needs_mapping(void *ptr, size_t size) +static inline int range_needs_mapping(phys_addr_t paddr, size_t size) { - return swiotlb_force || swiotlb_arch_range_needs_mapping(ptr, size); + return swiotlb_force || swiotlb_arch_range_needs_mapping(paddr, size); } static int is_swiotlb_buffer(char *addr) @@ -636,11 +636,14 @@ swiotlb_full(struct device *dev, size_t * Once the device is given the dma address, the device owns this memory until * either swiotlb_unmap_single or swiotlb_dma_sync_single is performed. */ -dma_addr_t -swiotlb_map_single_attrs(struct device *hwdev, void *ptr, size_t size, - int dir, struct dma_attrs *attrs) -{ - dma_addr_t dev_addr = swiotlb_virt_to_bus(hwdev, ptr); +dma_addr_t swiotlb_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + phys_addr_t phys = page_to_phys(page) + offset; + void *ptr = page_address(page) + offset; + dma_addr_t dev_addr = swiotlb_phys_to_bus(dev, phys); void *map; BUG_ON(dir == DMA_NONE); @@ -649,37 +652,30 @@ swiotlb_map_single_attrs(struct device * * we can safely return the device addr and not worry about bounce * buffering it. */ - if (!address_needs_mapping(hwdev, dev_addr, size) && - !range_needs_mapping(ptr, size)) + if (!address_needs_mapping(dev, dev_addr, size) && + !range_needs_mapping(virt_to_phys(ptr), size)) return dev_addr; /* * Oh well, have to allocate and map a bounce buffer. */ - map = map_single(hwdev, virt_to_phys(ptr), size, dir); + map = map_single(dev, phys, size, dir); if (!map) { - swiotlb_full(hwdev, size, dir, 1); + swiotlb_full(dev, size, dir, 1); map = io_tlb_overflow_buffer; } - dev_addr = swiotlb_virt_to_bus(hwdev, map); + dev_addr = swiotlb_virt_to_bus(dev, map); /* * Ensure that the address returned is DMA'ble */ - if (address_needs_mapping(hwdev, dev_addr, size)) + if (address_needs_mapping(dev, dev_addr, size)) panic("map_single: bounce buffer is not DMA'ble"); return dev_addr; } -EXPORT_SYMBOL(swiotlb_map_single_attrs); - -dma_addr_t -swiotlb_map_single(struct device *hwdev, void *ptr, size_t size, int dir) -{ - return swiotlb_map_single_attrs(hwdev, ptr, size, dir, NULL); -} -EXPORT_SYMBOL(swiotlb_map_single); +EXPORT_SYMBOL_GPL(swiotlb_map_page); /* * Unmap a single streaming mode DMA translation. The dma_addr and size must @@ -689,9 +685,9 @@ EXPORT_SYMBOL(swiotlb_map_single); * After this call, reads by the cpu to the buffer are guaranteed to see * whatever the device wrote there. */ -void -swiotlb_unmap_single_attrs(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir, struct dma_attrs *attrs) +void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) { char *dma_addr = swiotlb_bus_to_virt(dev_addr); @@ -701,15 +697,7 @@ swiotlb_unmap_single_attrs(struct device else if (dir == DMA_FROM_DEVICE) dma_mark_clean(dma_addr, size); } -EXPORT_SYMBOL(swiotlb_unmap_single_attrs); - -void -swiotlb_unmap_single(struct device *hwdev, dma_addr_t dev_addr, size_t size, - int dir) -{ - return swiotlb_unmap_single_attrs(hwdev, dev_addr, size, dir, NULL); -} -EXPORT_SYMBOL(swiotlb_unmap_single); +EXPORT_SYMBOL_GPL(swiotlb_unmap_page); /* * Make physical memory consistent for a single streaming mode DMA translation @@ -736,7 +724,7 @@ swiotlb_sync_single(struct device *hwdev void swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir) + size_t size, enum dma_data_direction dir) { swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU); } @@ -744,7 +732,7 @@ EXPORT_SYMBOL(swiotlb_sync_single_for_cp void swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr, - size_t size, int dir) + size_t size, enum dma_data_direction dir) { swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE); } @@ -769,7 +757,8 @@ swiotlb_sync_single_range(struct device void swiotlb_sync_single_range_for_cpu(struct device *hwdev, dma_addr_t dev_addr, - unsigned long offset, size_t size, int dir) + unsigned long offset, size_t size, + enum dma_data_direction dir) { swiotlb_sync_single_range(hwdev, dev_addr, offset, size, dir, SYNC_FOR_CPU); @@ -778,7 +767,8 @@ EXPORT_SYMBOL_GPL(swiotlb_sync_single_ra void swiotlb_sync_single_range_for_device(struct device *hwdev, dma_addr_t dev_addr, - unsigned long offset, size_t size, int dir) + unsigned long offset, size_t size, + enum dma_data_direction dir) { swiotlb_sync_single_range(hwdev, dev_addr, offset, size, dir, SYNC_FOR_DEVICE); @@ -803,7 +793,7 @@ EXPORT_SYMBOL_GPL(swiotlb_sync_single_ra */ int swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems, - int dir, struct dma_attrs *attrs) + enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *sg; int i; @@ -811,10 +801,10 @@ swiotlb_map_sg_attrs(struct device *hwde BUG_ON(dir == DMA_NONE); for_each_sg(sgl, sg, nelems, i) { - void *addr = sg_virt(sg); - dma_addr_t dev_addr = swiotlb_virt_to_bus(hwdev, addr); + phys_addr_t paddr = sg_phys(sg); + dma_addr_t dev_addr = swiotlb_phys_to_bus(hwdev, paddr); - if (range_needs_mapping(addr, sg->length) || + if (range_needs_mapping(paddr, sg->length) || address_needs_mapping(hwdev, dev_addr, sg->length)) { void *map = map_single(hwdev, sg_phys(sg), sg->length, dir); @@ -850,7 +840,7 @@ EXPORT_SYMBOL(swiotlb_map_sg); */ void swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl, - int nelems, int dir, struct dma_attrs *attrs) + int nelems, enum dma_data_direction dir, struct dma_attrs *attrs) { struct scatterlist *sg; int i; @@ -858,11 +848,11 @@ swiotlb_unmap_sg_attrs(struct device *hw BUG_ON(dir == DMA_NONE); for_each_sg(sgl, sg, nelems, i) { - if (sg->dma_address != swiotlb_virt_to_bus(hwdev, sg_virt(sg))) + if (sg->dma_address != swiotlb_phys_to_bus(hwdev, sg_phys(sg))) unmap_single(hwdev, swiotlb_bus_to_virt(sg->dma_address), sg->dma_length, dir); else if (dir == DMA_FROM_DEVICE) - dma_mark_clean(sg_virt(sg), sg->dma_length); + dma_mark_clean(swiotlb_bus_to_virt(sg->dma_address), sg->dma_length); } } EXPORT_SYMBOL(swiotlb_unmap_sg_attrs); @@ -892,17 +882,17 @@ swiotlb_sync_sg(struct device *hwdev, st BUG_ON(dir == DMA_NONE); for_each_sg(sgl, sg, nelems, i) { - if (sg->dma_address != swiotlb_virt_to_bus(hwdev, sg_virt(sg))) + if (sg->dma_address != swiotlb_phys_to_bus(hwdev, sg_phys(sg))) sync_single(hwdev, swiotlb_bus_to_virt(sg->dma_address), sg->dma_length, dir, target); else if (dir == DMA_FROM_DEVICE) - dma_mark_clean(sg_virt(sg), sg->dma_length); + dma_mark_clean(swiotlb_bus_to_virt(sg->dma_address), sg->dma_length); } } void swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg, - int nelems, int dir) + int nelems, enum dma_data_direction dir) { swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU); } @@ -910,7 +900,7 @@ EXPORT_SYMBOL(swiotlb_sync_sg_for_cpu); void swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg, - int nelems, int dir) + int nelems, enum dma_data_direction dir) { swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE); } Index: linux-2.6-tip/localversion-tip =================================================================== --- /dev/null +++ linux-2.6-tip/localversion-tip @@ -0,0 +1 @@ +-tip Index: linux-2.6-tip/mm/Makefile =================================================================== --- linux-2.6-tip.orig/mm/Makefile +++ linux-2.6-tip/mm/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o obj-$(CONFIG_SLAB) += slab.o obj-$(CONFIG_SLUB) += slub.o +obj-$(CONFIG_KMEMCHECK) += kmemcheck.o obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o obj-$(CONFIG_FS_XIP) += filemap_xip.o Index: linux-2.6-tip/mm/bootmem.c =================================================================== --- linux-2.6-tip.orig/mm/bootmem.c +++ linux-2.6-tip/mm/bootmem.c @@ -318,6 +318,8 @@ static int __init mark_bootmem(unsigned pos = bdata->node_low_pfn; } BUG(); + + return 0; } /** Index: linux-2.6-tip/mm/kmemcheck.c =================================================================== --- /dev/null +++ linux-2.6-tip/mm/kmemcheck.c @@ -0,0 +1,122 @@ +#include +#include +#include +#include +#include + +void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node) +{ + struct page *shadow; + int pages; + int i; + + pages = 1 << order; + + /* + * With kmemcheck enabled, we need to allocate a memory area for the + * shadow bits as well. + */ + shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order); + if (!shadow) { + if (printk_ratelimit()) + printk(KERN_ERR "kmemcheck: failed to allocate " + "shadow bitmap\n"); + return; + } + + for(i = 0; i < pages; ++i) + page[i].shadow = page_address(&shadow[i]); + + /* + * Mark it as non-present for the MMU so that our accesses to + * this memory will trigger a page fault and let us analyze + * the memory accesses. + */ + kmemcheck_hide_pages(page, pages); +} + +void kmemcheck_free_shadow(struct page *page, int order) +{ + struct page *shadow; + int pages; + int i; + + if (!kmemcheck_page_is_tracked(page)) + return; + + pages = 1 << order; + + kmemcheck_show_pages(page, pages); + + shadow = virt_to_page(page[0].shadow); + + for(i = 0; i < pages; ++i) + page[i].shadow = NULL; + + __free_pages(shadow, order); +} + +void kmemcheck_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, void *object, + size_t size) +{ + /* + * Has already been memset(), which initializes the shadow for us + * as well. + */ + if (gfpflags & __GFP_ZERO) + return; + + /* No need to initialize the shadow of a non-tracked slab. */ + if (s->flags & SLAB_NOTRACK) + return; + + if (!kmemcheck_enabled || gfpflags & __GFP_NOTRACK) { + /* + * Allow notracked objects to be allocated from + * tracked caches. Note however that these objects + * will still get page faults on access, they just + * won't ever be flagged as uninitialized. If page + * faults are not acceptable, the slab cache itself + * should be marked NOTRACK. + */ + kmemcheck_mark_initialized(object, size); + } else if (!s->ctor) { + /* + * New objects should be marked uninitialized before + * they're returned to the called. + */ + kmemcheck_mark_uninitialized(object, size); + } +} + +void kmemcheck_slab_free(struct kmem_cache *s, void *object, size_t size) +{ + /* TODO: RCU freeing is unsupported for now; hide false positives. */ + if (!s->ctor && !(s->flags & SLAB_DESTROY_BY_RCU)) + kmemcheck_mark_freed(object, size); +} + +void kmemcheck_pagealloc_alloc(struct page *page, unsigned int order, + gfp_t gfpflags) +{ + int pages; + + if (gfpflags & (__GFP_HIGHMEM | __GFP_NOTRACK)) + return; + + pages = 1 << order; + + /* + * NOTE: We choose to track GFP_ZERO pages too; in fact, they + * can become uninitialized by copying uninitialized memory + * into them. + */ + + /* XXX: Can use zone->node for node? */ + kmemcheck_alloc_shadow(page, order, gfpflags, -1); + + if (gfpflags & __GFP_ZERO) + kmemcheck_mark_initialized_pages(page, pages); + else + kmemcheck_mark_uninitialized_pages(page, pages); +} Index: linux-2.6-tip/mm/mempolicy.c =================================================================== --- linux-2.6-tip.orig/mm/mempolicy.c +++ linux-2.6-tip/mm/mempolicy.c @@ -1421,7 +1421,7 @@ unsigned slab_node(struct mempolicy *pol } default: - BUG(); + panic("slab_node: bad policy mode!"); } } Index: linux-2.6-tip/mm/page_alloc.c =================================================================== --- linux-2.6-tip.orig/mm/page_alloc.c +++ linux-2.6-tip/mm/page_alloc.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -162,6 +163,53 @@ static unsigned long __meminitdata dma_r EXPORT_SYMBOL(movable_zone); #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ +#ifdef CONFIG_PREEMPT_RT +static DEFINE_PER_CPU_LOCKED(int, pcp_locks); +#endif + +static inline void __lock_cpu_pcp(unsigned long *flags, int cpu) +{ +#ifdef CONFIG_PREEMPT_RT + spin_lock(&__get_cpu_lock(pcp_locks, cpu)); + flags = 0; +#else + local_irq_save(*flags); +#endif +} + +static inline void lock_cpu_pcp(unsigned long *flags, int *this_cpu) +{ +#ifdef CONFIG_PREEMPT_RT + (void)get_cpu_var_locked(pcp_locks, this_cpu); + flags = 0; +#else + local_irq_save(*flags); + *this_cpu = smp_processor_id(); +#endif +} + +static inline void unlock_cpu_pcp(unsigned long flags, int this_cpu) +{ +#ifdef CONFIG_PREEMPT_RT + put_cpu_var_locked(pcp_locks, this_cpu); +#else + local_irq_restore(flags); +#endif +} + +static struct per_cpu_pageset * +get_zone_pcp(struct zone *zone, unsigned long *flags, int *this_cpu) +{ + lock_cpu_pcp(flags, this_cpu); + return zone_pcp(zone, *this_cpu); +} + +static void +put_zone_pcp(struct zone *zone, unsigned long flags, int this_cpu) +{ + unlock_cpu_pcp(flags, this_cpu); +} + #if MAX_NUMNODES > 1 int nr_node_ids __read_mostly = MAX_NUMNODES; EXPORT_SYMBOL(nr_node_ids); @@ -546,8 +594,9 @@ static void free_one_page(struct zone *z static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; - int i; - int bad = 0; + int i, this_cpu, bad = 0; + + kmemcheck_free_shadow(page, order); for (i = 0 ; i < (1 << order) ; ++i) bad += free_pages_check(page + i); @@ -562,10 +611,10 @@ static void __free_pages_ok(struct page arch_free_page(page, order); kernel_map_pages(page, 1 << order, 0); - local_irq_save(flags); - __count_vm_events(PGFREE, 1 << order); + lock_cpu_pcp(&flags, &this_cpu); + count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, order); - local_irq_restore(flags); + unlock_cpu_pcp(flags, this_cpu); } /* @@ -898,15 +947,16 @@ void drain_zone_pages(struct zone *zone, { unsigned long flags; int to_drain; + int this_cpu; - local_irq_save(flags); + lock_cpu_pcp(&flags, &this_cpu); if (pcp->count >= pcp->batch) to_drain = pcp->batch; else to_drain = pcp->count; free_pages_bulk(zone, to_drain, &pcp->list, 0); pcp->count -= to_drain; - local_irq_restore(flags); + unlock_cpu_pcp(flags, this_cpu); } #endif @@ -930,12 +980,15 @@ static void drain_pages(unsigned int cpu continue; pset = zone_pcp(zone, cpu); - + if (!pset) { + WARN_ON(1); + continue; + } pcp = &pset->pcp; - local_irq_save(flags); + lock_cpu_pcp(&flags, &cpu); free_pages_bulk(zone, pcp->count, &pcp->list, 0); pcp->count = 0; - local_irq_restore(flags); + unlock_cpu_pcp(flags, cpu); } } @@ -947,12 +1000,52 @@ void drain_local_pages(void *arg) drain_pages(smp_processor_id()); } +#ifdef CONFIG_PREEMPT_RT +static void drain_local_pages_work(struct work_struct *wrk) +{ + drain_pages(smp_processor_id()); +} +#endif + /* * Spill all the per-cpu pages from all CPUs back into the buddy allocator */ void drain_all_pages(void) { +#ifdef CONFIG_PREEMPT_RT + /* + * HACK!!!!! + * For RT we can't use IPIs to run drain_local_pages, since + * that code will call spin_locks that will now sleep. + * But, schedule_on_each_cpu will call kzalloc, which will + * call page_alloc which was what calls this. + * + * Luckily, there's a condition to get here, and that is if + * the order passed in to alloc_pages is greater than 0 + * (alloced more than a page size). The slabs only allocate + * what is needed, and the allocation made by schedule_on_each_cpu + * does an alloc of "sizeof(void *)*nr_cpu_ids". + * + * So we can safely call schedule_on_each_cpu if that number + * is less than a page. Otherwise don't bother. At least warn of + * this issue. + * + * And yes, this is one big hack. Please fix ;-) + */ + if (sizeof(void *)*nr_cpu_ids < PAGE_SIZE) + schedule_on_each_cpu(drain_local_pages_work); + else { + static int once; + if (!once) { + printk(KERN_ERR "Can't drain all CPUS due to possible recursion\n"); + once = 1; + } + drain_local_pages(NULL); + } + +#else on_each_cpu(drain_local_pages, NULL, 1); +#endif } #ifdef CONFIG_HIBERNATION @@ -997,8 +1090,12 @@ void mark_free_pages(struct zone *zone) static void free_hot_cold_page(struct page *page, int cold) { struct zone *zone = page_zone(page); + struct per_cpu_pageset *pset; struct per_cpu_pages *pcp; unsigned long flags; + int this_cpu; + + kmemcheck_free_shadow(page, 0); if (PageAnon(page)) page->mapping = NULL; @@ -1012,9 +1109,11 @@ static void free_hot_cold_page(struct pa arch_free_page(page, 0); kernel_map_pages(page, 1, 0); - pcp = &zone_pcp(zone, get_cpu())->pcp; - local_irq_save(flags); - __count_vm_event(PGFREE); + pset = get_zone_pcp(zone, &flags, &this_cpu); + pcp = &pset->pcp; + + count_vm_event(PGFREE); + if (cold) list_add_tail(&page->lru, &pcp->list); else @@ -1025,8 +1124,7 @@ static void free_hot_cold_page(struct pa free_pages_bulk(zone, pcp->batch, &pcp->list, 0); pcp->count -= pcp->batch; } - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); } void free_hot_page(struct page *page) @@ -1068,16 +1166,15 @@ static struct page *buffered_rmqueue(str unsigned long flags; struct page *page; int cold = !!(gfp_flags & __GFP_COLD); - int cpu; + struct per_cpu_pageset *pset; int migratetype = allocflags_to_migratetype(gfp_flags); + int this_cpu; again: - cpu = get_cpu(); + pset = get_zone_pcp(zone, &flags, &this_cpu); if (likely(order == 0)) { - struct per_cpu_pages *pcp; + struct per_cpu_pages *pcp = &pset->pcp; - pcp = &zone_pcp(zone, cpu)->pcp; - local_irq_save(flags); if (!pcp->count) { pcp->count = rmqueue_bulk(zone, 0, pcp->batch, &pcp->list, migratetype); @@ -1106,7 +1203,7 @@ again: list_del(&page->lru); pcp->count--; } else { - spin_lock_irqsave(&zone->lock, flags); + spin_lock(&zone->lock); page = __rmqueue(zone, order, migratetype); spin_unlock(&zone->lock); if (!page) @@ -1115,8 +1212,7 @@ again: __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(preferred_zone, zone); - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); VM_BUG_ON(bad_range(zone, page)); if (prep_new_page(page, order, gfp_flags)) @@ -1124,8 +1220,7 @@ again: return page; failed: - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); return NULL; } @@ -1479,6 +1574,8 @@ __alloc_pages_internal(gfp_t gfp_mask, u unsigned long did_some_progress; unsigned long pages_reclaimed = 0; + lockdep_trace_alloc(gfp_mask); + might_sleep_if(wait); if (should_fail_alloc_page(gfp_mask, order)) @@ -1578,12 +1675,15 @@ nofail_alloc: */ cpuset_update_task_memory_state(); p->flags |= PF_MEMALLOC; + + lockdep_set_current_reclaim_state(gfp_mask); reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); p->reclaim_state = NULL; + lockdep_clear_current_reclaim_state(); p->flags &= ~PF_MEMALLOC; cond_resched(); @@ -1667,7 +1767,10 @@ nopage: dump_stack(); show_mem(); } + return page; got_pg: + if (kmemcheck_enabled) + kmemcheck_pagealloc_alloc(page, order, gfp_mask); return page; } EXPORT_SYMBOL(__alloc_pages_internal); Index: linux-2.6-tip/mm/slab.c =================================================================== --- linux-2.6-tip.orig/mm/slab.c +++ linux-2.6-tip/mm/slab.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include #include @@ -112,12 +113,70 @@ #include #include #include +#include #include #include #include /* + * On !PREEMPT_RT, raw irq flags are used as a per-CPU locking + * mechanism. + * + * On PREEMPT_RT, we use per-CPU locks for this. That's why the + * calling convention is changed slightly: a new 'flags' argument + * is passed to 'irq disable/enable' - the PREEMPT_RT code stores + * the CPU number of the lock there. + */ +#ifndef CONFIG_PREEMPT_RT +# define slab_irq_disable(cpu) \ + do { local_irq_disable(); (cpu) = smp_processor_id(); } while (0) +# define slab_irq_enable(cpu) local_irq_enable() +# define slab_irq_save(flags, cpu) \ + do { local_irq_save(flags); (cpu) = smp_processor_id(); } while (0) +# define slab_irq_restore(flags, cpu) local_irq_restore(flags) +/* + * In the __GFP_WAIT case we enable/disable interrupts on !PREEMPT_RT, + * which has no per-CPU locking effect since we are holding the cache + * lock in that case already. + * + * (On PREEMPT_RT, these are NOPs, but we have to drop/get the irq locks.) + */ +# define slab_irq_disable_nort(cpu) slab_irq_disable(cpu) +# define slab_irq_enable_nort(cpu) slab_irq_enable(cpu) +# define slab_irq_disable_rt(flags) do { (void)(flags); } while (0) +# define slab_irq_enable_rt(flags) do { (void)(flags); } while (0) +# define slab_spin_lock_irq(lock, cpu) \ + do { spin_lock_irq(lock); (cpu) = smp_processor_id(); } while (0) +# define slab_spin_unlock_irq(lock, cpu) \ + spin_unlock_irq(lock) +# define slab_spin_lock_irqsave(lock, flags, cpu) \ + do { spin_lock_irqsave(lock, flags); (cpu) = smp_processor_id(); } while (0) +# define slab_spin_unlock_irqrestore(lock, flags, cpu) \ + do { spin_unlock_irqrestore(lock, flags); } while (0) +#else +DEFINE_PER_CPU_LOCKED(int, slab_irq_locks) = { 0, }; +# define slab_irq_disable(cpu) (void)get_cpu_var_locked(slab_irq_locks, &(cpu)) +# define slab_irq_enable(cpu) put_cpu_var_locked(slab_irq_locks, cpu) +# define slab_irq_save(flags, cpu) \ + do { slab_irq_disable(cpu); (void) (flags); } while (0) +# define slab_irq_restore(flags, cpu) \ + do { slab_irq_enable(cpu); (void) (flags); } while (0) +# define slab_irq_disable_rt(cpu) slab_irq_disable(cpu) +# define slab_irq_enable_rt(cpu) slab_irq_enable(cpu) +# define slab_irq_disable_nort(cpu) do { } while (0) +# define slab_irq_enable_nort(cpu) do { } while (0) +# define slab_spin_lock_irq(lock, cpu) \ + do { slab_irq_disable(cpu); spin_lock(lock); } while (0) +# define slab_spin_unlock_irq(lock, cpu) \ + do { spin_unlock(lock); slab_irq_enable(cpu); } while (0) +# define slab_spin_lock_irqsave(lock, flags, cpu) \ + do { slab_irq_disable(cpu); spin_lock_irqsave(lock, flags); } while (0) +# define slab_spin_unlock_irqrestore(lock, flags, cpu) \ + do { spin_unlock_irqrestore(lock, flags); slab_irq_enable(cpu); } while (0) +#endif + +/* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON. * 0 for faster, smaller code (especially in the critical paths). * @@ -177,13 +236,13 @@ SLAB_STORE_USER | \ SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \ SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \ - SLAB_DEBUG_OBJECTS) + SLAB_DEBUG_OBJECTS | SLAB_NOTRACK) #else # define CREATE_MASK (SLAB_HWCACHE_ALIGN | \ SLAB_CACHE_DMA | \ SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \ SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \ - SLAB_DEBUG_OBJECTS) + SLAB_DEBUG_OBJECTS | SLAB_NOTRACK) #endif /* @@ -313,7 +372,7 @@ struct kmem_list3 __initdata initkmem_li static int drain_freelist(struct kmem_cache *cache, struct kmem_list3 *l3, int tofree); static void free_block(struct kmem_cache *cachep, void **objpp, int len, - int node); + int node, int *this_cpu); static int enable_cpucache(struct kmem_cache *cachep); static void cache_reap(struct work_struct *unused); @@ -372,87 +431,6 @@ static void kmem_list3_init(struct kmem_ MAKE_LIST((cachep), (&(ptr)->slabs_free), slabs_free, nodeid); \ } while (0) -/* - * struct kmem_cache - * - * manages a cache. - */ - -struct kmem_cache { -/* 1) per-cpu data, touched during every alloc/free */ - struct array_cache *array[NR_CPUS]; -/* 2) Cache tunables. Protected by cache_chain_mutex */ - unsigned int batchcount; - unsigned int limit; - unsigned int shared; - - unsigned int buffer_size; - u32 reciprocal_buffer_size; -/* 3) touched by every alloc & free from the backend */ - - unsigned int flags; /* constant flags */ - unsigned int num; /* # of objs per slab */ - -/* 4) cache_grow/shrink */ - /* order of pgs per slab (2^n) */ - unsigned int gfporder; - - /* force GFP flags, e.g. GFP_DMA */ - gfp_t gfpflags; - - size_t colour; /* cache colouring range */ - unsigned int colour_off; /* colour offset */ - struct kmem_cache *slabp_cache; - unsigned int slab_size; - unsigned int dflags; /* dynamic flags */ - - /* constructor func */ - void (*ctor)(void *obj); - -/* 5) cache creation/removal */ - const char *name; - struct list_head next; - -/* 6) statistics */ -#if STATS - unsigned long num_active; - unsigned long num_allocations; - unsigned long high_mark; - unsigned long grown; - unsigned long reaped; - unsigned long errors; - unsigned long max_freeable; - unsigned long node_allocs; - unsigned long node_frees; - unsigned long node_overflow; - atomic_t allochit; - atomic_t allocmiss; - atomic_t freehit; - atomic_t freemiss; -#endif -#if DEBUG - /* - * If debugging is enabled, then the allocator can add additional - * fields and/or padding to every object. buffer_size contains the total - * object size including these internal fields, the following two - * variables contain the offset to the user object and its size. - */ - int obj_offset; - int obj_size; -#endif - /* - * We put nodelists[] at the end of kmem_cache, because we want to size - * this array to nr_node_ids slots instead of MAX_NUMNODES - * (see kmem_cache_init()) - * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache - * is statically defined, so we reserve the max number of nodes. - */ - struct kmem_list3 *nodelists[MAX_NUMNODES]; - /* - * Do not add fields after nodelists[] - */ -}; - #define CFLGS_OFF_SLAB (0x80000000UL) #define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB) @@ -568,6 +546,14 @@ static void **dbg_userword(struct kmem_c #endif +#ifdef CONFIG_KMEMTRACE +size_t slab_buffer_size(struct kmem_cache *cachep) +{ + return cachep->buffer_size; +} +EXPORT_SYMBOL(slab_buffer_size); +#endif + /* * Do not go above this order unless 0 objects fit into the slab. */ @@ -756,9 +742,10 @@ int slab_is_available(void) static DEFINE_PER_CPU(struct delayed_work, reap_work); -static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) +static inline struct array_cache * +cpu_cache_get(struct kmem_cache *cachep, int this_cpu) { - return cachep->array[smp_processor_id()]; + return cachep->array[this_cpu]; } static inline struct kmem_cache *__find_general_cachep(size_t size, @@ -992,7 +979,7 @@ static int transfer_objects(struct array #ifndef CONFIG_NUMA #define drain_alien_cache(cachep, alien) do { } while (0) -#define reap_alien(cachep, l3) do { } while (0) +#define reap_alien(cachep, l3, this_cpu) 0 static inline struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1003,27 +990,29 @@ static inline void free_alien_cache(stru { } -static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) +static inline int +cache_free_alien(struct kmem_cache *cachep, void *objp, int *this_cpu) { return 0; } static inline void *alternate_node_alloc(struct kmem_cache *cachep, - gfp_t flags) + gfp_t flags, int *this_cpu) { return NULL; } static inline void *____cache_alloc_node(struct kmem_cache *cachep, - gfp_t flags, int nodeid) + gfp_t flags, int nodeid, int *this_cpu) { return NULL; } #else /* CONFIG_NUMA */ -static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); -static void *alternate_node_alloc(struct kmem_cache *, gfp_t); +static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, + int nodeid, int *this_cpu); +static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int *); static struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1064,7 +1053,8 @@ static void free_alien_cache(struct arra } static void __drain_alien_cache(struct kmem_cache *cachep, - struct array_cache *ac, int node) + struct array_cache *ac, int node, + int *this_cpu) { struct kmem_list3 *rl3 = cachep->nodelists[node]; @@ -1078,7 +1068,7 @@ static void __drain_alien_cache(struct k if (rl3->shared) transfer_objects(rl3->shared, ac, ac->limit); - free_block(cachep, ac->entry, ac->avail, node); + free_block(cachep, ac->entry, ac->avail, node, this_cpu); ac->avail = 0; spin_unlock(&rl3->list_lock); } @@ -1087,38 +1077,42 @@ static void __drain_alien_cache(struct k /* * Called from cache_reap() to regularly drain alien caches round robin. */ -static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3) +static int +reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, int *this_cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, *this_cpu); if (l3->alien) { struct array_cache *ac = l3->alien[node]; if (ac && ac->avail && spin_trylock_irq(&ac->lock)) { - __drain_alien_cache(cachep, ac, node); + __drain_alien_cache(cachep, ac, node, this_cpu); spin_unlock_irq(&ac->lock); + return 1; } } + return 0; } static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien) { - int i = 0; + int i = 0, this_cpu; struct array_cache *ac; unsigned long flags; for_each_online_node(i) { ac = alien[i]; if (ac) { - spin_lock_irqsave(&ac->lock, flags); - __drain_alien_cache(cachep, ac, i); - spin_unlock_irqrestore(&ac->lock, flags); + slab_spin_lock_irqsave(&ac->lock, flags, this_cpu); + __drain_alien_cache(cachep, ac, i, &this_cpu); + slab_spin_unlock_irqrestore(&ac->lock, flags, this_cpu); } } } -static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) +static inline int +cache_free_alien(struct kmem_cache *cachep, void *objp, int *this_cpu) { struct slab *slabp = virt_to_slab(objp); int nodeid = slabp->nodeid; @@ -1126,7 +1120,7 @@ static inline int cache_free_alien(struc struct array_cache *alien = NULL; int node; - node = numa_node_id(); + node = cpu_to_node(*this_cpu); /* * Make sure we are not freeing a object from another node to the array @@ -1142,13 +1136,13 @@ static inline int cache_free_alien(struc spin_lock(&alien->lock); if (unlikely(alien->avail == alien->limit)) { STATS_INC_ACOVERFLOW(cachep); - __drain_alien_cache(cachep, alien, nodeid); + __drain_alien_cache(cachep, alien, nodeid, this_cpu); } alien->entry[alien->avail++] = objp; spin_unlock(&alien->lock); } else { spin_lock(&(cachep->nodelists[nodeid])->list_lock); - free_block(cachep, &objp, 1, nodeid); + free_block(cachep, &objp, 1, nodeid, this_cpu); spin_unlock(&(cachep->nodelists[nodeid])->list_lock); } return 1; @@ -1166,6 +1160,7 @@ static void __cpuinit cpuup_canceled(lon struct array_cache *nc; struct array_cache *shared; struct array_cache **alien; + int this_cpu; /* cpu is dead; no one can alloc from it. */ nc = cachep->array[cpu]; @@ -1175,29 +1170,31 @@ static void __cpuinit cpuup_canceled(lon if (!l3) goto free_array_cache; - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); /* Free limit for this kmem_list3 */ l3->free_limit -= cachep->batchcount; if (nc) - free_block(cachep, nc->entry, nc->avail, node); + free_block(cachep, nc->entry, nc->avail, node, + &this_cpu); if (!cpus_empty(*mask)) { - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, + this_cpu); goto free_array_cache; } shared = l3->shared; if (shared) { free_block(cachep, shared->entry, - shared->avail, node); + shared->avail, node, &this_cpu); l3->shared = NULL; } alien = l3->alien; l3->alien = NULL; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); if (alien) { @@ -1226,6 +1223,7 @@ static int __cpuinit cpuup_prepare(long struct kmem_list3 *l3 = NULL; int node = cpu_to_node(cpu); const int memsize = sizeof(struct kmem_list3); + int this_cpu; /* * We need to do this right in the beginning since @@ -1256,11 +1254,11 @@ static int __cpuinit cpuup_prepare(long cachep->nodelists[node] = l3; } - spin_lock_irq(&cachep->nodelists[node]->list_lock); + slab_spin_lock_irq(&cachep->nodelists[node]->list_lock, this_cpu); cachep->nodelists[node]->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - spin_unlock_irq(&cachep->nodelists[node]->list_lock); + slab_spin_unlock_irq(&cachep->nodelists[node]->list_lock, this_cpu); } /* @@ -1297,7 +1295,7 @@ static int __cpuinit cpuup_prepare(long l3 = cachep->nodelists[node]; BUG_ON(!l3); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (!l3->shared) { /* * We are serialised from CPU_DEAD or @@ -1312,7 +1310,7 @@ static int __cpuinit cpuup_prepare(long alien = NULL; } #endif - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); free_alien_cache(alien); } @@ -1389,11 +1387,13 @@ static void init_list(struct kmem_cache int nodeid) { struct kmem_list3 *ptr; + int this_cpu; ptr = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, nodeid); BUG_ON(!ptr); - local_irq_disable(); + WARN_ON(spin_is_locked(&list->list_lock)); + slab_irq_disable(this_cpu); memcpy(ptr, list, sizeof(struct kmem_list3)); /* * Do not assume that spinlocks can be initialized via memcpy: @@ -1402,7 +1402,7 @@ static void init_list(struct kmem_cache MAKE_ALL_LISTS(cachep, ptr, nodeid); cachep->nodelists[nodeid] = ptr; - local_irq_enable(); + slab_irq_enable(this_cpu); } /* @@ -1565,36 +1565,34 @@ void __init kmem_cache_init(void) /* 4) Replace the bootstrap head arrays */ { struct array_cache *ptr; + int this_cpu; ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL); - local_irq_disable(); - BUG_ON(cpu_cache_get(&cache_cache) != &initarray_cache.cache); - memcpy(ptr, cpu_cache_get(&cache_cache), - sizeof(struct arraycache_init)); + slab_irq_disable(this_cpu); + BUG_ON(cpu_cache_get(&cache_cache, this_cpu) != &initarray_cache.cache); + memcpy(ptr, cpu_cache_get(&cache_cache, this_cpu), + sizeof(struct arraycache_init)); /* * Do not assume that spinlocks can be initialized via memcpy: */ spin_lock_init(&ptr->lock); - - cache_cache.array[smp_processor_id()] = ptr; - local_irq_enable(); + cache_cache.array[this_cpu] = ptr; + slab_irq_enable(this_cpu); ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL); - local_irq_disable(); - BUG_ON(cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep) - != &initarray_generic.cache); - memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep), - sizeof(struct arraycache_init)); + slab_irq_disable(this_cpu); + BUG_ON(cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep, this_cpu) + != &initarray_generic.cache); + memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep, this_cpu), + sizeof(struct arraycache_init)); /* * Do not assume that spinlocks can be initialized via memcpy: */ spin_lock_init(&ptr->lock); - - malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] = - ptr; - local_irq_enable(); + malloc_sizes[INDEX_AC].cs_cachep->array[this_cpu] = ptr; + slab_irq_enable(this_cpu); } /* 5) Replace the bootstrap kmem_list3's */ { @@ -1680,7 +1678,7 @@ static void *kmem_getpages(struct kmem_c if (cachep->flags & SLAB_RECLAIM_ACCOUNT) flags |= __GFP_RECLAIMABLE; - page = alloc_pages_node(nodeid, flags, cachep->gfporder); + page = alloc_pages_node(nodeid, flags & ~__GFP_NOTRACK, cachep->gfporder); if (!page) return NULL; @@ -1693,6 +1691,16 @@ static void *kmem_getpages(struct kmem_c NR_SLAB_UNRECLAIMABLE, nr_pages); for (i = 0; i < nr_pages; i++) __SetPageSlab(page + i); + + if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { + kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); + + if (cachep->ctor) + kmemcheck_mark_uninitialized_pages(page, nr_pages); + else + kmemcheck_mark_unallocated_pages(page, nr_pages); + } + return page_address(page); } @@ -1705,6 +1713,8 @@ static void kmem_freepages(struct kmem_c struct page *page = virt_to_page(addr); const unsigned long nr_freed = i; + kmemcheck_free_shadow(page, cachep->gfporder); + if (cachep->flags & SLAB_RECLAIM_ACCOUNT) sub_zone_page_state(page_zone(page), NR_SLAB_RECLAIMABLE, nr_freed); @@ -1746,7 +1756,7 @@ static void store_stackinfo(struct kmem_ *addr++ = 0x12345678; *addr++ = caller; - *addr++ = smp_processor_id(); + *addr++ = raw_smp_processor_id(); size -= 3 * sizeof(unsigned long); { unsigned long *sptr = &caller; @@ -1936,6 +1946,10 @@ static void slab_destroy_debugcheck(stru } #endif +static void +__cache_free(struct kmem_cache *cachep, void *objp, int *this_cpu); + + /** * slab_destroy - destroy and release all objects in a slab * @cachep: cache pointer being destroyed @@ -1945,7 +1959,8 @@ static void slab_destroy_debugcheck(stru * Before calling the slab must have been unlinked from the cache. The * cache-lock is not held/needed. */ -static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp) +static void +slab_destroy(struct kmem_cache *cachep, struct slab *slabp, int *this_cpu) { void *addr = slabp->s_mem - slabp->colouroff; @@ -1959,8 +1974,12 @@ static void slab_destroy(struct kmem_cac call_rcu(&slab_rcu->head, kmem_rcu_free); } else { kmem_freepages(cachep, addr); - if (OFF_SLAB(cachep)) - kmem_cache_free(cachep->slabp_cache, slabp); + if (OFF_SLAB(cachep)) { + if (this_cpu) + __cache_free(cachep->slabp_cache, slabp, this_cpu); + else + kmem_cache_free(cachep->slabp_cache, slabp); + } } } @@ -2057,6 +2076,8 @@ static size_t calculate_slab_order(struc static int __init_refok setup_cpu_cache(struct kmem_cache *cachep) { + int this_cpu; + if (g_cpucache_up == FULL) return enable_cpucache(cachep); @@ -2100,10 +2121,12 @@ static int __init_refok setup_cpu_cache( jiffies + REAPTIMEOUT_LIST3 + ((unsigned long)cachep) % REAPTIMEOUT_LIST3; - cpu_cache_get(cachep)->avail = 0; - cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES; - cpu_cache_get(cachep)->batchcount = 1; - cpu_cache_get(cachep)->touched = 0; + this_cpu = raw_smp_processor_id(); + + cpu_cache_get(cachep, this_cpu)->avail = 0; + cpu_cache_get(cachep, this_cpu)->limit = BOOT_CPUCACHE_ENTRIES; + cpu_cache_get(cachep, this_cpu)->batchcount = 1; + cpu_cache_get(cachep, this_cpu)->touched = 0; cachep->batchcount = 1; cachep->limit = BOOT_CPUCACHE_ENTRIES; return 0; @@ -2394,19 +2417,19 @@ EXPORT_SYMBOL(kmem_cache_create); #if DEBUG static void check_irq_off(void) { +/* + * On PREEMPT_RT we use locks to protect the per-CPU lists, + * and keep interrupts enabled. + */ +#ifndef CONFIG_PREEMPT_RT BUG_ON(!irqs_disabled()); +#endif } static void check_irq_on(void) { +#ifndef CONFIG_PREEMPT_RT BUG_ON(irqs_disabled()); -} - -static void check_spinlock_acquired(struct kmem_cache *cachep) -{ -#ifdef CONFIG_SMP - check_irq_off(); - assert_spin_locked(&cachep->nodelists[numa_node_id()]->list_lock); #endif } @@ -2421,34 +2444,67 @@ static void check_spinlock_acquired_node #else #define check_irq_off() do { } while(0) #define check_irq_on() do { } while(0) -#define check_spinlock_acquired(x) do { } while(0) #define check_spinlock_acquired_node(x, y) do { } while(0) #endif -static void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, +static int drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, struct array_cache *ac, int force, int node); -static void do_drain(void *arg) +static void __do_drain(void *arg, int this_cpu) { struct kmem_cache *cachep = arg; + int node = cpu_to_node(this_cpu); struct array_cache *ac; - int node = numa_node_id(); check_irq_off(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, this_cpu); spin_lock(&cachep->nodelists[node]->list_lock); - free_block(cachep, ac->entry, ac->avail, node); + free_block(cachep, ac->entry, ac->avail, node, &this_cpu); spin_unlock(&cachep->nodelists[node]->list_lock); ac->avail = 0; } +#ifdef CONFIG_PREEMPT_RT +static void do_drain(void *arg, int this_cpu) +{ + __do_drain(arg, this_cpu); +} +#else +static void do_drain(void *arg) +{ + __do_drain(arg, smp_processor_id()); +} +#endif + +#ifdef CONFIG_PREEMPT_RT +/* + * execute func() for all CPUs. On PREEMPT_RT we dont actually have + * to run on the remote CPUs - we only have to take their CPU-locks. + * (This is a rare operation, so cacheline bouncing is not an issue.) + */ +static void +slab_on_each_cpu(void (*func)(void *arg, int this_cpu), void *arg) +{ + unsigned int i; + + check_irq_on(); + for_each_online_cpu(i) { + spin_lock(&__get_cpu_lock(slab_irq_locks, i)); + func(arg, i); + spin_unlock(&__get_cpu_lock(slab_irq_locks, i)); + } +} +#else +# define slab_on_each_cpu(func, cachep) on_each_cpu(func, cachep, 1) +#endif + static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_list3 *l3; int node; - on_each_cpu(do_drain, cachep, 1); + slab_on_each_cpu(do_drain, cachep); check_irq_on(); for_each_online_node(node) { l3 = cachep->nodelists[node]; @@ -2473,16 +2529,16 @@ static int drain_freelist(struct kmem_ca struct kmem_list3 *l3, int tofree) { struct list_head *p; - int nr_freed; + int nr_freed, this_cpu; struct slab *slabp; nr_freed = 0; while (nr_freed < tofree && !list_empty(&l3->slabs_free)) { - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); p = l3->slabs_free.prev; if (p == &l3->slabs_free) { - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); goto out; } @@ -2491,13 +2547,9 @@ static int drain_freelist(struct kmem_ca BUG_ON(slabp->inuse); #endif list_del(&slabp->list); - /* - * Safe to drop the lock. The slab is no longer linked - * to the cache. - */ l3->free_objects -= cache->num; - spin_unlock_irq(&l3->list_lock); - slab_destroy(cache, slabp); + slab_destroy(cache, slabp, &this_cpu); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); nr_freed++; } out: @@ -2753,8 +2805,8 @@ static void slab_map_pages(struct kmem_c * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, void *objp) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid, + void *objp, int *this_cpu) { struct slab *slabp; size_t offset; @@ -2783,7 +2835,8 @@ static int cache_grow(struct kmem_cache offset *= cachep->colour_off; if (local_flags & __GFP_WAIT) - local_irq_enable(); + slab_irq_enable_nort(*this_cpu); + slab_irq_enable_rt(*this_cpu); /* * The test for missing atomic flag is performed here, rather than @@ -2812,8 +2865,10 @@ static int cache_grow(struct kmem_cache cache_init_objs(cachep, slabp); + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(*this_cpu); + check_irq_off(); spin_lock(&l3->list_lock); @@ -2826,8 +2881,9 @@ static int cache_grow(struct kmem_cache opps1: kmem_freepages(cachep, objp); failed: + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(*this_cpu); return 0; } @@ -2949,7 +3005,8 @@ bad: #define check_slabp(x,y) do { } while(0) #endif -static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) +static void * +cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { int batchcount; struct kmem_list3 *l3; @@ -2959,7 +3016,7 @@ static void *cache_alloc_refill(struct k retry: check_irq_off(); node = numa_node_id(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { /* @@ -2969,7 +3026,7 @@ retry: */ batchcount = BATCHREFILL_LIMIT; } - l3 = cachep->nodelists[node]; + l3 = cachep->nodelists[cpu_to_node(*this_cpu)]; BUG_ON(ac->avail > 0 || !l3); spin_lock(&l3->list_lock); @@ -2992,7 +3049,7 @@ retry: slabp = list_entry(entry, struct slab, list); check_slabp(cachep, slabp); - check_spinlock_acquired(cachep); + check_spinlock_acquired_node(cachep, cpu_to_node(*this_cpu)); /* * The slab was either on partial or free list so @@ -3006,8 +3063,9 @@ retry: STATS_INC_ACTIVE(cachep); STATS_SET_HIGH(cachep); - ac->entry[ac->avail++] = slab_get_obj(cachep, slabp, - node); + ac->entry[ac->avail++] = + slab_get_obj(cachep, slabp, + cpu_to_node(*this_cpu)); } check_slabp(cachep, slabp); @@ -3026,10 +3084,10 @@ alloc_done: if (unlikely(!ac->avail)) { int x; - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, flags | GFP_THISNODE, cpu_to_node(*this_cpu), NULL, this_cpu); /* cache_grow can reenable interrupts, then ac could change. */ - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); if (!x && ac->avail == 0) /* no objects in sight? abort */ return NULL; @@ -3116,21 +3174,22 @@ static bool slab_should_failslab(struct return should_failslab(obj_size(cachep), flags); } -static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags) +static inline void * +____cache_alloc(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { void *objp; struct array_cache *ac; check_irq_off(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); if (likely(ac->avail)) { STATS_INC_ALLOCHIT(cachep); ac->touched = 1; objp = ac->entry[--ac->avail]; } else { STATS_INC_ALLOCMISS(cachep); - objp = cache_alloc_refill(cachep, flags); + objp = cache_alloc_refill(cachep, flags, this_cpu); } return objp; } @@ -3142,7 +3201,8 @@ static inline void *____cache_alloc(stru * If we are in_interrupt, then process context, including cpusets and * mempolicy, may not apply and should not be used for allocation policy. */ -static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) +static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags, + int *this_cpu) { int nid_alloc, nid_here; @@ -3154,7 +3214,7 @@ static void *alternate_node_alloc(struct else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); if (nid_alloc != nid_here) - return ____cache_alloc_node(cachep, flags, nid_alloc); + return ____cache_alloc_node(cachep, flags, nid_alloc, this_cpu); return NULL; } @@ -3166,7 +3226,7 @@ static void *alternate_node_alloc(struct * allocator to do its reclaim / fallback magic. We then insert the * slab into the proper nodelist and then allocate from it. */ -static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) +static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int *this_cpu) { struct zonelist *zonelist; gfp_t local_flags; @@ -3194,7 +3254,8 @@ retry: cache->nodelists[nid] && cache->nodelists[nid]->free_objects) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + flags | GFP_THISNODE, nid, + this_cpu); if (obj) break; } @@ -3208,19 +3269,24 @@ retry: * set and go into memory reserves if necessary. */ if (local_flags & __GFP_WAIT) - local_irq_enable(); + slab_irq_enable_nort(*this_cpu); + slab_irq_enable_rt(*this_cpu); + kmem_flagcheck(cache, flags); obj = kmem_getpages(cache, local_flags, -1); + + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(*this_cpu); + if (obj) { /* * Insert into the appropriate per node queues */ nid = page_to_nid(virt_to_page(obj)); - if (cache_grow(cache, flags, nid, obj)) { + if (cache_grow(cache, flags, nid, obj, this_cpu)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + flags | GFP_THISNODE, nid, this_cpu); if (!obj) /* * Another processor may allocate the @@ -3241,7 +3307,7 @@ retry: * A interface to enable slab creation on nodeid */ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, - int nodeid) + int nodeid, int *this_cpu) { struct list_head *entry; struct slab *slabp; @@ -3289,11 +3355,11 @@ retry: must_grow: spin_unlock(&l3->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL, this_cpu); if (x) goto retry; - return fallback_alloc(cachep, flags); + return fallback_alloc(cachep, flags, this_cpu); done: return obj; @@ -3316,40 +3382,47 @@ __cache_alloc_node(struct kmem_cache *ca void *caller) { unsigned long save_flags; + int this_cpu; void *ptr; + lockdep_trace_alloc(flags); + if (slab_should_failslab(cachep, flags)) return NULL; cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); + + slab_irq_save(save_flags, this_cpu); if (unlikely(nodeid == -1)) - nodeid = numa_node_id(); + nodeid = cpu_to_node(this_cpu); if (unlikely(!cachep->nodelists[nodeid])) { /* Node not bootstrapped yet */ - ptr = fallback_alloc(cachep, flags); + ptr = fallback_alloc(cachep, flags, &this_cpu); goto out; } - if (nodeid == numa_node_id()) { + if (nodeid == cpu_to_node(this_cpu)) { /* * Use the locally cached objects if possible. * However ____cache_alloc does not allow fallback * to other nodes. It may fail while we still have * objects on other nodes available. */ - ptr = ____cache_alloc(cachep, flags); + ptr = ____cache_alloc(cachep, flags, &this_cpu); if (ptr) goto out; } /* ___cache_alloc_node can fall back to other nodes */ - ptr = ____cache_alloc_node(cachep, flags, nodeid); + ptr = ____cache_alloc_node(cachep, flags, nodeid, &this_cpu); out: - local_irq_restore(save_flags); + slab_irq_restore(save_flags, this_cpu); ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller); + if (likely(ptr)) + kmemcheck_slab_alloc(cachep, flags, ptr, obj_size(cachep)); + if (unlikely((flags & __GFP_ZERO) && ptr)) memset(ptr, 0, obj_size(cachep)); @@ -3357,33 +3430,33 @@ __cache_alloc_node(struct kmem_cache *ca } static __always_inline void * -__do_cache_alloc(struct kmem_cache *cache, gfp_t flags) +__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int *this_cpu) { void *objp; if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) { - objp = alternate_node_alloc(cache, flags); + objp = alternate_node_alloc(cache, flags, this_cpu); if (objp) goto out; } - objp = ____cache_alloc(cache, flags); + objp = ____cache_alloc(cache, flags, this_cpu); /* * We may just have run out of memory on the local node. * ____cache_alloc_node() knows how to locate memory on other nodes */ - if (!objp) - objp = ____cache_alloc_node(cache, flags, numa_node_id()); - + if (!objp) + objp = ____cache_alloc_node(cache, flags, + cpu_to_node(*this_cpu), this_cpu); out: return objp; } #else static __always_inline void * -__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags) +__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { - return ____cache_alloc(cachep, flags); + return ____cache_alloc(cachep, flags, this_cpu); } #endif /* CONFIG_NUMA */ @@ -3392,18 +3465,24 @@ static __always_inline void * __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller) { unsigned long save_flags; + int this_cpu; void *objp; + lockdep_trace_alloc(flags); + if (slab_should_failslab(cachep, flags)) return NULL; cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); - objp = __do_cache_alloc(cachep, flags); - local_irq_restore(save_flags); + slab_irq_save(save_flags, this_cpu); + objp = __do_cache_alloc(cachep, flags, &this_cpu); + slab_irq_restore(save_flags, this_cpu); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); prefetchw(objp); + if (likely(objp)) + kmemcheck_slab_alloc(cachep, flags, objp, obj_size(cachep)); + if (unlikely((flags & __GFP_ZERO) && objp)) memset(objp, 0, obj_size(cachep)); @@ -3414,7 +3493,7 @@ __cache_alloc(struct kmem_cache *cachep, * Caller needs to acquire correct kmem_list's list_lock */ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects, - int node) + int node, int *this_cpu) { int i; struct kmem_list3 *l3; @@ -3443,7 +3522,7 @@ static void free_block(struct kmem_cache * a different cache, refer to comments before * alloc_slabmgmt. */ - slab_destroy(cachep, slabp); + slab_destroy(cachep, slabp, this_cpu); } else { list_add(&slabp->list, &l3->slabs_free); } @@ -3457,11 +3536,12 @@ static void free_block(struct kmem_cache } } -static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) +static void +cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac, int *this_cpu) { int batchcount; struct kmem_list3 *l3; - int node = numa_node_id(); + int node = cpu_to_node(*this_cpu); batchcount = ac->batchcount; #if DEBUG @@ -3483,7 +3563,7 @@ static void cache_flusharray(struct kmem } } - free_block(cachep, ac->entry, batchcount, node); + free_block(cachep, ac->entry, batchcount, node, this_cpu); free_done: #if STATS { @@ -3512,13 +3592,15 @@ free_done: * Release an obj back to its cache. If the obj has a constructed state, it must * be in this state _before_ it is released. Called with disabled ints. */ -static inline void __cache_free(struct kmem_cache *cachep, void *objp) +static void __cache_free(struct kmem_cache *cachep, void *objp, int *this_cpu) { - struct array_cache *ac = cpu_cache_get(cachep); + struct array_cache *ac = cpu_cache_get(cachep, *this_cpu); check_irq_off(); objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0)); + kmemcheck_slab_free(cachep, objp, obj_size(cachep)); + /* * Skip calling cache_free_alien() when the platform is not numa. * This will avoid cache misses that happen while accessing slabp (which @@ -3526,7 +3608,7 @@ static inline void __cache_free(struct k * variable to skip the call, which is mostly likely to be present in * the cache. */ - if (numa_platform && cache_free_alien(cachep, objp)) + if (numa_platform && cache_free_alien(cachep, objp, this_cpu)) return; if (likely(ac->avail < ac->limit)) { @@ -3535,7 +3617,7 @@ static inline void __cache_free(struct k return; } else { STATS_INC_FREEMISS(cachep); - cache_flusharray(cachep, ac); + cache_flusharray(cachep, ac, this_cpu); ac->entry[ac->avail++] = objp; } } @@ -3550,10 +3632,23 @@ static inline void __cache_free(struct k */ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) { - return __cache_alloc(cachep, flags, __builtin_return_address(0)); + void *ret = __cache_alloc(cachep, flags, __builtin_return_address(0)); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret, + obj_size(cachep), cachep->buffer_size, flags); + + return ret; } EXPORT_SYMBOL(kmem_cache_alloc); +#ifdef CONFIG_KMEMTRACE +void *kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags) +{ + return __cache_alloc(cachep, flags, __builtin_return_address(0)); +} +EXPORT_SYMBOL(kmem_cache_alloc_notrace); +#endif + /** * kmem_ptr_validate - check if an untrusted pointer might be a slab entry. * @cachep: the cache we're checking against @@ -3598,23 +3693,47 @@ out: #ifdef CONFIG_NUMA void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid) { - return __cache_alloc_node(cachep, flags, nodeid, - __builtin_return_address(0)); + void *ret = __cache_alloc_node(cachep, flags, nodeid, + __builtin_return_address(0)); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret, + obj_size(cachep), cachep->buffer_size, + flags, nodeid); + + return ret; } EXPORT_SYMBOL(kmem_cache_alloc_node); +#ifdef CONFIG_KMEMTRACE +void *kmem_cache_alloc_node_notrace(struct kmem_cache *cachep, + gfp_t flags, + int nodeid) +{ + return __cache_alloc_node(cachep, flags, nodeid, + __builtin_return_address(0)); +} +EXPORT_SYMBOL(kmem_cache_alloc_node_notrace); +#endif + static __always_inline void * __do_kmalloc_node(size_t size, gfp_t flags, int node, void *caller) { struct kmem_cache *cachep; + void *ret; cachep = kmem_find_general_cachep(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; - return kmem_cache_alloc_node(cachep, flags, node); + ret = kmem_cache_alloc_node_notrace(cachep, flags, node); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, + (unsigned long) caller, ret, + size, cachep->buffer_size, flags, node); + + return ret; } -#ifdef CONFIG_DEBUG_SLAB +#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_KMEMTRACE) void *__kmalloc_node(size_t size, gfp_t flags, int node) { return __do_kmalloc_node(size, flags, node, @@ -3647,6 +3766,7 @@ static __always_inline void *__do_kmallo void *caller) { struct kmem_cache *cachep; + void *ret; /* If you want to save a few bytes .text space: replace * __ with kmem_. @@ -3656,11 +3776,17 @@ static __always_inline void *__do_kmallo cachep = __find_general_cachep(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; - return __cache_alloc(cachep, flags, caller); + ret = __cache_alloc(cachep, flags, caller); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, + (unsigned long) caller, ret, + size, cachep->buffer_size, flags); + + return ret; } -#ifdef CONFIG_DEBUG_SLAB +#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_KMEMTRACE) void *__kmalloc(size_t size, gfp_t flags) { return __do_kmalloc(size, flags, __builtin_return_address(0)); @@ -3692,13 +3818,16 @@ EXPORT_SYMBOL(__kmalloc); void kmem_cache_free(struct kmem_cache *cachep, void *objp) { unsigned long flags; + int this_cpu; - local_irq_save(flags); + slab_irq_save(flags, this_cpu); debug_check_no_locks_freed(objp, obj_size(cachep)); if (!(cachep->flags & SLAB_DEBUG_OBJECTS)) debug_check_no_obj_freed(objp, obj_size(cachep)); - __cache_free(cachep, objp); - local_irq_restore(flags); + __cache_free(cachep, objp, &this_cpu); + slab_irq_restore(flags, this_cpu); + + kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, objp); } EXPORT_SYMBOL(kmem_cache_free); @@ -3715,16 +3844,19 @@ void kfree(const void *objp) { struct kmem_cache *c; unsigned long flags; + int this_cpu; if (unlikely(ZERO_OR_NULL_PTR(objp))) return; - local_irq_save(flags); + slab_irq_save(flags, this_cpu); kfree_debugcheck(objp); c = virt_to_cache(objp); debug_check_no_locks_freed(objp, obj_size(c)); debug_check_no_obj_freed(objp, obj_size(c)); - __cache_free(c, (void *)objp); - local_irq_restore(flags); + __cache_free(c, (void *)objp, &this_cpu); + slab_irq_restore(flags, this_cpu); + + kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, objp); } EXPORT_SYMBOL(kfree); @@ -3745,7 +3877,7 @@ EXPORT_SYMBOL_GPL(kmem_cache_name); */ static int alloc_kmemlist(struct kmem_cache *cachep) { - int node; + int node, this_cpu; struct kmem_list3 *l3; struct array_cache *new_shared; struct array_cache **new_alien = NULL; @@ -3773,11 +3905,11 @@ static int alloc_kmemlist(struct kmem_ca if (l3) { struct array_cache *shared = l3->shared; - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (shared) free_block(cachep, shared->entry, - shared->avail, node); + shared->avail, node, &this_cpu); l3->shared = new_shared; if (!l3->alien) { @@ -3786,7 +3918,7 @@ static int alloc_kmemlist(struct kmem_ca } l3->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); free_alien_cache(new_alien); continue; @@ -3833,42 +3965,50 @@ struct ccupdate_struct { struct array_cache *new[NR_CPUS]; }; -static void do_ccupdate_local(void *info) +static void __do_ccupdate_local(void *info, int this_cpu) { struct ccupdate_struct *new = info; struct array_cache *old; check_irq_off(); - old = cpu_cache_get(new->cachep); + old = cpu_cache_get(new->cachep, this_cpu); - new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()]; - new->new[smp_processor_id()] = old; + new->cachep->array[this_cpu] = new->new[this_cpu]; + new->new[this_cpu] = old; +} + +#ifdef CONFIG_PREEMPT_RT +static void do_ccupdate_local(void *arg, int this_cpu) +{ + __do_ccupdate_local(arg, this_cpu); } +#else +static void do_ccupdate_local(void *arg) +{ + __do_ccupdate_local(arg, smp_processor_id()); +} +#endif /* Always called with the cache_chain_mutex held */ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount, int shared) { - struct ccupdate_struct *new; - int i; - - new = kzalloc(sizeof(*new), GFP_KERNEL); - if (!new) - return -ENOMEM; + struct ccupdate_struct new; + int i, this_cpu; + memset(&new.new, 0, sizeof(new.new)); for_each_online_cpu(i) { - new->new[i] = alloc_arraycache(cpu_to_node(i), limit, + new.new[i] = alloc_arraycache(cpu_to_node(i), limit, batchcount); - if (!new->new[i]) { + if (!new.new[i]) { for (i--; i >= 0; i--) - kfree(new->new[i]); - kfree(new); + kfree(new.new[i]); return -ENOMEM; } } - new->cachep = cachep; + new.cachep = cachep; - on_each_cpu(do_ccupdate_local, (void *)new, 1); + slab_on_each_cpu(do_ccupdate_local, (void *)&new); check_irq_on(); cachep->batchcount = batchcount; @@ -3876,15 +4016,15 @@ static int do_tune_cpucache(struct kmem_ cachep->shared = shared; for_each_online_cpu(i) { - struct array_cache *ccold = new->new[i]; + struct array_cache *ccold = new.new[i]; if (!ccold) continue; - spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); - free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i)); - spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); + slab_spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock, this_cpu); + free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i), &this_cpu); + slab_spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock, this_cpu); kfree(ccold); } - kfree(new); + return alloc_kmemlist(cachep); } @@ -3946,29 +4086,31 @@ static int enable_cpucache(struct kmem_c * Drain an array if it contains any elements taking the l3 lock only if * necessary. Note that the l3 listlock also protects the array_cache * if drain_array() is used on the shared array. + * returns non-zero if some work is done */ -void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, - struct array_cache *ac, int force, int node) +int drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, + struct array_cache *ac, int force, int node) { - int tofree; + int tofree, this_cpu; if (!ac || !ac->avail) - return; + return 0; if (ac->touched && !force) { ac->touched = 0; } else { - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (ac->avail) { tofree = force ? ac->avail : (ac->limit + 4) / 5; if (tofree > ac->avail) tofree = (ac->avail + 1) / 2; - free_block(cachep, ac->entry, tofree, node); + free_block(cachep, ac->entry, tofree, node, &this_cpu); ac->avail -= tofree; memmove(ac->entry, &(ac->entry[tofree]), sizeof(void *) * ac->avail); } - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } + return 1; } /** @@ -3985,11 +4127,12 @@ void drain_array(struct kmem_cache *cach */ static void cache_reap(struct work_struct *w) { + int this_cpu = raw_smp_processor_id(), node = cpu_to_node(this_cpu); struct kmem_cache *searchp; struct kmem_list3 *l3; - int node = numa_node_id(); struct delayed_work *work = container_of(w, struct delayed_work, work); + int work_done = 0; if (!mutex_trylock(&cache_chain_mutex)) /* Give up. Setup the next iteration. */ @@ -4005,9 +4148,12 @@ static void cache_reap(struct work_struc */ l3 = searchp->nodelists[node]; - reap_alien(searchp, l3); + work_done += reap_alien(searchp, l3, &this_cpu); + + node = cpu_to_node(this_cpu); - drain_array(searchp, l3, cpu_cache_get(searchp), 0, node); + work_done += drain_array(searchp, l3, + cpu_cache_get(searchp, this_cpu), 0, node); /* * These are racy checks but it does not matter @@ -4018,7 +4164,7 @@ static void cache_reap(struct work_struc l3->next_reap = jiffies + REAPTIMEOUT_LIST3; - drain_array(searchp, l3, l3->shared, 0, node); + work_done += drain_array(searchp, l3, l3->shared, 0, node); if (l3->free_touched) l3->free_touched = 0; @@ -4037,7 +4183,8 @@ next: next_reap_node(); out: /* Set up the next iteration */ - schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC)); + schedule_delayed_work(work, + round_jiffies_relative((1+!work_done) * REAPTIMEOUT_CPUC)); } #ifdef CONFIG_SLABINFO @@ -4096,7 +4243,7 @@ static int s_show(struct seq_file *m, vo unsigned long num_slabs, free_objects = 0, shared_avail = 0; const char *name; char *error = NULL; - int node; + int this_cpu, node; struct kmem_list3 *l3; active_objs = 0; @@ -4107,7 +4254,7 @@ static int s_show(struct seq_file *m, vo continue; check_irq_on(); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); list_for_each_entry(slabp, &l3->slabs_full, list) { if (slabp->inuse != cachep->num && !error) @@ -4132,7 +4279,7 @@ static int s_show(struct seq_file *m, vo if (l3->shared) shared_avail += l3->shared->avail; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } num_slabs += active_slabs; num_objs = num_slabs * cachep->num; @@ -4341,7 +4488,7 @@ static int leaks_show(struct seq_file *m struct kmem_list3 *l3; const char *name; unsigned long *n = m->private; - int node; + int node, this_cpu; int i; if (!(cachep->flags & SLAB_STORE_USER)) @@ -4359,13 +4506,13 @@ static int leaks_show(struct seq_file *m continue; check_irq_on(); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); list_for_each_entry(slabp, &l3->slabs_full, list) handle_slab(n, cachep, slabp); list_for_each_entry(slabp, &l3->slabs_partial, list) handle_slab(n, cachep, slabp); - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } name = cachep->name; if (n[0] == n[1]) { Index: linux-2.6-tip/mm/slob.c =================================================================== --- linux-2.6-tip.orig/mm/slob.c +++ linux-2.6-tip/mm/slob.c @@ -65,6 +65,7 @@ #include #include #include +#include #include /* @@ -463,27 +464,40 @@ void *__kmalloc_node(size_t size, gfp_t { unsigned int *m; int align = max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); + void *ret; + + lockdep_trace_alloc(flags); if (size < PAGE_SIZE - align) { if (!size) return ZERO_SIZE_PTR; m = slob_alloc(size + align, gfp, align, node); + if (!m) return NULL; *m = size; - return (void *)m + align; + ret = (void *)m + align; + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, + _RET_IP_, ret, + size, size + align, gfp, node); } else { - void *ret; + unsigned int order = get_order(size); - ret = slob_new_page(gfp | __GFP_COMP, get_order(size), node); + ret = slob_new_page(gfp | __GFP_COMP, order, node); if (ret) { struct page *page; page = virt_to_page(ret); page->private = size; } - return ret; + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, + _RET_IP_, ret, + size, PAGE_SIZE << order, gfp, node); } + + return ret; } EXPORT_SYMBOL(__kmalloc_node); @@ -501,6 +515,8 @@ void kfree(const void *block) slob_free(m, *m + align); } else put_page(&sp->page); + + kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, block); } EXPORT_SYMBOL(kfree); @@ -570,10 +586,19 @@ void *kmem_cache_alloc_node(struct kmem_ { void *b; - if (c->size < PAGE_SIZE) + if (c->size < PAGE_SIZE) { b = slob_alloc(c->size, flags, c->align, node); - else + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, + _RET_IP_, b, c->size, + SLOB_UNITS(c->size) * SLOB_UNIT, + flags, node); + } else { b = slob_new_page(flags, get_order(c->size), node); + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, + _RET_IP_, b, c->size, + PAGE_SIZE << get_order(c->size), + flags, node); + } if (c->ctor) c->ctor(b); @@ -609,6 +634,8 @@ void kmem_cache_free(struct kmem_cache * } else { __kmem_cache_free(b, c->size); } + + kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, b); } EXPORT_SYMBOL(kmem_cache_free); Index: linux-2.6-tip/mm/slub.c =================================================================== --- linux-2.6-tip.orig/mm/slub.c +++ linux-2.6-tip/mm/slub.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -24,6 +25,7 @@ #include #include #include +#include #include /* @@ -144,7 +146,7 @@ SLAB_TRACE | SLAB_DESTROY_BY_RCU) #define SLUB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \ - SLAB_CACHE_DMA) + SLAB_CACHE_DMA | SLAB_NOTRACK) #ifndef ARCH_KMALLOC_MINALIGN #define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long) @@ -1068,6 +1070,8 @@ static inline struct page *alloc_slab_pa { int order = oo_order(oo); + flags |= __GFP_NOTRACK; + if (node == -1) return alloc_pages(flags, order); else @@ -1095,6 +1099,24 @@ static struct page *allocate_slab(struct stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK); } + + if (kmemcheck_enabled + && !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) + { + int pages = 1 << oo_order(oo); + + kmemcheck_alloc_shadow(page, oo_order(oo), flags, node); + + /* + * Objects from caches that have a constructor don't get + * cleared when they're allocated, so we need to do it here. + */ + if (s->ctor) + kmemcheck_mark_uninitialized_pages(page, pages); + else + kmemcheck_mark_unallocated_pages(page, pages); + } + page->objects = oo_objects(oo); mod_zone_page_state(page_zone(page), (s->flags & SLAB_RECLAIM_ACCOUNT) ? @@ -1168,6 +1190,8 @@ static void __free_slab(struct kmem_cach __ClearPageSlubDebug(page); } + kmemcheck_free_shadow(page, compound_order(page)); + mod_zone_page_state(page_zone(page), (s->flags & SLAB_RECLAIM_ACCOUNT) ? NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, @@ -1596,6 +1620,7 @@ static __always_inline void *slab_alloc( unsigned long flags; unsigned int objsize; + lockdep_trace_alloc(gfpflags); might_sleep_if(gfpflags & __GFP_WAIT); if (should_failslab(s->objsize, gfpflags)) @@ -1618,23 +1643,52 @@ static __always_inline void *slab_alloc( if (unlikely((gfpflags & __GFP_ZERO) && object)) memset(object, 0, objsize); + kmemcheck_slab_alloc(s, gfpflags, object, c->objsize); return object; } void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) { - return slab_alloc(s, gfpflags, -1, _RET_IP_); + void *ret = slab_alloc(s, gfpflags, -1, _RET_IP_); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret, + s->objsize, s->size, gfpflags); + + return ret; } EXPORT_SYMBOL(kmem_cache_alloc); +#ifdef CONFIG_KMEMTRACE +void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags) +{ + return slab_alloc(s, gfpflags, -1, _RET_IP_); +} +EXPORT_SYMBOL(kmem_cache_alloc_notrace); +#endif + #ifdef CONFIG_NUMA void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node) { - return slab_alloc(s, gfpflags, node, _RET_IP_); + void *ret = slab_alloc(s, gfpflags, node, _RET_IP_); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret, + s->objsize, s->size, gfpflags, node); + + return ret; } EXPORT_SYMBOL(kmem_cache_alloc_node); #endif +#ifdef CONFIG_KMEMTRACE +void *kmem_cache_alloc_node_notrace(struct kmem_cache *s, + gfp_t gfpflags, + int node) +{ + return slab_alloc(s, gfpflags, node, _RET_IP_); +} +EXPORT_SYMBOL(kmem_cache_alloc_node_notrace); +#endif + /* * Slow patch handling. This may still be called frequently since objects * have a longer lifetime than the cpu slabs in most processing loads. @@ -1722,6 +1776,7 @@ static __always_inline void slab_free(st local_irq_save(flags); c = get_cpu_slab(s, smp_processor_id()); + kmemcheck_slab_free(s, object, c->objsize); debug_check_no_locks_freed(object, c->objsize); if (!(s->flags & SLAB_DEBUG_OBJECTS)) debug_check_no_obj_freed(object, s->objsize); @@ -1742,6 +1797,8 @@ void kmem_cache_free(struct kmem_cache * page = virt_to_head_page(x); slab_free(s, page, x, _RET_IP_); + + kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, x); } EXPORT_SYMBOL(kmem_cache_free); @@ -2475,7 +2532,7 @@ EXPORT_SYMBOL(kmem_cache_destroy); * Kmalloc subsystem *******************************************************************/ -struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned; +struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); static int __init setup_slub_min_order(char *str) @@ -2537,7 +2594,7 @@ panic: } #ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; +static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT]; static void sysfs_add_func(struct work_struct *w) { @@ -2583,7 +2640,8 @@ static noinline struct kmem_cache *dma_k if (!s || !text || !kmem_cache_open(s, flags, text, realsize, ARCH_KMALLOC_MINALIGN, - SLAB_CACHE_DMA|__SYSFS_ADD_DEFERRED, NULL)) { + SLAB_CACHE_DMA|SLAB_NOTRACK|__SYSFS_ADD_DEFERRED, + NULL)) { kfree(s); kfree(text); goto unlock_out; @@ -2657,8 +2715,9 @@ static struct kmem_cache *get_slab(size_ void *__kmalloc(size_t size, gfp_t flags) { struct kmem_cache *s; + void *ret; - if (unlikely(size > PAGE_SIZE)) + if (unlikely(size > SLUB_MAX_SIZE)) return kmalloc_large(size, flags); s = get_slab(size, flags); @@ -2666,15 +2725,21 @@ void *__kmalloc(size_t size, gfp_t flags if (unlikely(ZERO_OR_NULL_PTR(s))) return s; - return slab_alloc(s, flags, -1, _RET_IP_); + ret = slab_alloc(s, flags, -1, _RET_IP_); + + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, ret, + size, s->size, flags); + + return ret; } EXPORT_SYMBOL(__kmalloc); static void *kmalloc_large_node(size_t size, gfp_t flags, int node) { - struct page *page = alloc_pages_node(node, flags | __GFP_COMP, - get_order(size)); + struct page *page; + flags |= __GFP_COMP | __GFP_NOTRACK; + page = alloc_pages_node(node, flags, get_order(size)); if (page) return page_address(page); else @@ -2685,16 +2750,30 @@ static void *kmalloc_large_node(size_t s void *__kmalloc_node(size_t size, gfp_t flags, int node) { struct kmem_cache *s; + void *ret; + + if (unlikely(size > SLUB_MAX_SIZE)) { + ret = kmalloc_large_node(size, flags, node); - if (unlikely(size > PAGE_SIZE)) - return kmalloc_large_node(size, flags, node); + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, + _RET_IP_, ret, + size, PAGE_SIZE << get_order(size), + flags, node); + + return ret; + } s = get_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(s))) return s; - return slab_alloc(s, flags, node, _RET_IP_); + ret = slab_alloc(s, flags, node, _RET_IP_); + + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, ret, + size, s->size, flags, node); + + return ret; } EXPORT_SYMBOL(__kmalloc_node); #endif @@ -2753,6 +2832,8 @@ void kfree(const void *x) return; } slab_free(page->slab, page, object, _RET_IP_); + + kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, x); } EXPORT_SYMBOL(kfree); @@ -2986,7 +3067,7 @@ void __init kmem_cache_init(void) caches++; } - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) { + for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) { create_kmalloc_cache(&kmalloc_caches[i], "kmalloc", 1 << i, GFP_KERNEL); caches++; @@ -3023,7 +3104,7 @@ void __init kmem_cache_init(void) slab_state = UP; /* Provide the correct kmalloc names now that the caches are up */ - for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) + for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) kmalloc_caches[i]. name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i); @@ -3222,8 +3303,9 @@ static struct notifier_block __cpuinitda void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) { struct kmem_cache *s; + void *ret; - if (unlikely(size > PAGE_SIZE)) + if (unlikely(size > SLUB_MAX_SIZE)) return kmalloc_large(size, gfpflags); s = get_slab(size, gfpflags); @@ -3231,15 +3313,22 @@ void *__kmalloc_track_caller(size_t size if (unlikely(ZERO_OR_NULL_PTR(s))) return s; - return slab_alloc(s, gfpflags, -1, caller); + ret = slab_alloc(s, gfpflags, -1, caller); + + /* Honor the call site pointer we recieved. */ + kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, caller, ret, size, + s->size, gfpflags); + + return ret; } void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, int node, unsigned long caller) { struct kmem_cache *s; + void *ret; - if (unlikely(size > PAGE_SIZE)) + if (unlikely(size > SLUB_MAX_SIZE)) return kmalloc_large_node(size, gfpflags, node); s = get_slab(size, gfpflags); @@ -3247,7 +3336,13 @@ void *__kmalloc_node_track_caller(size_t if (unlikely(ZERO_OR_NULL_PTR(s))) return s; - return slab_alloc(s, gfpflags, node, caller); + ret = slab_alloc(s, gfpflags, node, caller); + + /* Honor the call site pointer we recieved. */ + kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, caller, ret, + size, s->size, gfpflags, node); + + return ret; } #ifdef CONFIG_SLUB_DEBUG @@ -4302,6 +4397,8 @@ static char *create_unique_id(struct kme *p++ = 'a'; if (s->flags & SLAB_DEBUG_FREE) *p++ = 'F'; + if (!(s->flags & SLAB_NOTRACK)) + *p++ = 't'; if (p != name + 1) *p++ = '-'; p += sprintf(p, "%07d", s->size); Index: linux-2.6-tip/mm/swapfile.c =================================================================== --- linux-2.6-tip.orig/mm/swapfile.c +++ linux-2.6-tip/mm/swapfile.c @@ -585,13 +585,14 @@ int free_swap_and_cache(swp_entry_t entr p = swap_info_get(entry); if (p) { if (swap_entry_free(p, entry) == 1) { + spin_unlock(&swap_lock); page = find_get_page(&swapper_space, entry.val); if (page && !trylock_page(page)) { page_cache_release(page); page = NULL; } - } - spin_unlock(&swap_lock); + } else + spin_unlock(&swap_lock); } if (page) { /* @@ -1649,7 +1650,7 @@ SYSCALL_DEFINE2(swapon, const char __use union swap_header *swap_header = NULL; unsigned int nr_good_pages = 0; int nr_extents = 0; - sector_t span; + sector_t uninitialized_var(span); unsigned long maxpages = 1; unsigned long swapfilepages; unsigned short *swap_map = NULL; Index: linux-2.6-tip/mm/vmscan.c =================================================================== --- linux-2.6-tip.orig/mm/vmscan.c +++ linux-2.6-tip/mm/vmscan.c @@ -23,6 +23,7 @@ #include #include #include +#include #include /* for try_to_release_page(), buffer_heads_over_limit */ #include @@ -1125,7 +1126,7 @@ static unsigned long shrink_inactive_lis } nr_reclaimed += nr_freed; - local_irq_disable(); + local_irq_disable_nort(); if (current_is_kswapd()) { __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); __count_vm_events(KSWAPD_STEAL, nr_freed); @@ -1166,9 +1167,14 @@ static unsigned long shrink_inactive_lis } } } while (nr_scanned < max_scan); + /* + * Non-PREEMPT_RT relies on IRQs-off protecting the page_states + * per-CPU data. PREEMPT_RT has that data protected even in + * __mod_page_state(), so no need to keep IRQs disabled. + */ spin_unlock(&zone->lru_lock); done: - local_irq_enable(); + local_irq_enable_nort(); pagevec_release(&pvec); return nr_reclaimed; } @@ -1965,6 +1971,8 @@ static int kswapd(void *p) }; node_to_cpumask_ptr(cpumask, pgdat->node_id); + lockdep_set_current_reclaim_state(GFP_KERNEL); + if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(tsk, cpumask); current->reclaim_state = &reclaim_state; Index: linux-2.6-tip/net/9p/Kconfig =================================================================== --- linux-2.6-tip.orig/net/9p/Kconfig +++ linux-2.6-tip/net/9p/Kconfig @@ -4,6 +4,8 @@ menuconfig NET_9P depends on NET && EXPERIMENTAL + # build breakage + depends on 0 tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)" help If you say Y here, you will get experimental support for Index: linux-2.6-tip/net/core/skbuff.c =================================================================== --- linux-2.6-tip.orig/net/core/skbuff.c +++ linux-2.6-tip/net/core/skbuff.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -205,6 +206,8 @@ struct sk_buff *__alloc_skb(unsigned int skb->data = data; skb_reset_tail_pointer(skb); skb->end = skb->tail + size; + kmemcheck_annotate_bitfield(skb->flags1); + kmemcheck_annotate_bitfield(skb->flags2); /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); atomic_set(&shinfo->dataref, 1); @@ -219,6 +222,8 @@ struct sk_buff *__alloc_skb(unsigned int struct sk_buff *child = skb + 1; atomic_t *fclone_ref = (atomic_t *) (child + 1); + kmemcheck_annotate_bitfield(child->flags1); + kmemcheck_annotate_bitfield(child->flags2); skb->fclone = SKB_FCLONE_ORIG; atomic_set(fclone_ref, 1); @@ -248,7 +253,7 @@ nodata: struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, gfp_t gfp_mask) { - int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1; + int node = dev_to_node(&dev->dev); struct sk_buff *skb; skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node); @@ -386,7 +391,7 @@ static void skb_release_head_state(struc secpath_put(skb->sp); #endif if (skb->destructor) { - WARN_ON(in_irq()); +// WARN_ON(in_irq()); skb->destructor(skb); } #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) @@ -608,6 +613,9 @@ struct sk_buff *skb_clone(struct sk_buff n = kmem_cache_alloc(skbuff_head_cache, gfp_mask); if (!n) return NULL; + + kmemcheck_annotate_bitfield(n->flags1); + kmemcheck_annotate_bitfield(n->flags2); n->fclone = SKB_FCLONE_UNAVAILABLE; } Index: linux-2.6-tip/net/ipv4/inet_timewait_sock.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/inet_timewait_sock.c +++ linux-2.6-tip/net/ipv4/inet_timewait_sock.c @@ -9,6 +9,7 @@ */ #include +#include #include #include #include @@ -117,6 +118,8 @@ struct inet_timewait_sock *inet_twsk_all if (tw != NULL) { const struct inet_sock *inet = inet_sk(sk); + kmemcheck_annotate_bitfield(tw->flags); + /* Give us an identity. */ tw->tw_daddr = inet->daddr; tw->tw_rcv_saddr = inet->rcv_saddr; Index: linux-2.6-tip/net/netfilter/ipvs/ip_vs_ctl.c =================================================================== --- linux-2.6-tip.orig/net/netfilter/ipvs/ip_vs_ctl.c +++ linux-2.6-tip/net/netfilter/ipvs/ip_vs_ctl.c @@ -2315,6 +2315,7 @@ __ip_vs_get_dest_entries(const struct ip static inline void __ip_vs_get_timeouts(struct ip_vs_timeout_user *u) { + memset(u, 0, sizeof(*u)); #ifdef CONFIG_IP_VS_PROTO_TCP u->tcp_timeout = ip_vs_protocol_tcp.timeout_table[IP_VS_TCP_S_ESTABLISHED] / HZ; Index: linux-2.6-tip/net/netfilter/nf_conntrack_ftp.c =================================================================== --- linux-2.6-tip.orig/net/netfilter/nf_conntrack_ftp.c +++ linux-2.6-tip/net/netfilter/nf_conntrack_ftp.c @@ -588,3 +588,4 @@ static int __init nf_conntrack_ftp_init( module_init(nf_conntrack_ftp_init); module_exit(nf_conntrack_ftp_fini); + Index: linux-2.6-tip/net/netfilter/nf_conntrack_proto_sctp.c =================================================================== --- linux-2.6-tip.orig/net/netfilter/nf_conntrack_proto_sctp.c +++ linux-2.6-tip/net/netfilter/nf_conntrack_proto_sctp.c @@ -373,6 +373,9 @@ static int sctp_packet(struct nf_conn *c } write_unlock_bh(&sctp_lock); + if (new_state == SCTP_CONNTRACK_MAX) + goto out; + nf_ct_refresh_acct(ct, ctinfo, skb, sctp_timeouts[new_state]); if (old_state == SCTP_CONNTRACK_COOKIE_ECHOED && Index: linux-2.6-tip/net/packet/af_packet.c =================================================================== --- linux-2.6-tip.orig/net/packet/af_packet.c +++ linux-2.6-tip/net/packet/af_packet.c @@ -711,7 +711,7 @@ static int tpacket_rcv(struct sk_buff *s hdrlen = sizeof(*h.h2); break; default: - BUG(); + panic("AF_PACKET: bad tp->version"); } sll = h.raw + TPACKET_ALIGN(hdrlen); Index: linux-2.6-tip/net/rfkill/rfkill.c =================================================================== --- linux-2.6-tip.orig/net/rfkill/rfkill.c +++ linux-2.6-tip/net/rfkill/rfkill.c @@ -387,6 +387,7 @@ static const char *rfkill_get_type_str(e return "wwan"; default: BUG(); + return NULL; } } Index: linux-2.6-tip/net/sunrpc/svcauth_unix.c =================================================================== --- linux-2.6-tip.orig/net/sunrpc/svcauth_unix.c +++ linux-2.6-tip/net/sunrpc/svcauth_unix.c @@ -682,7 +682,7 @@ svcauth_unix_set_client(struct svc_rqst sin6 = svc_addr_in6(rqstp); break; default: - BUG(); + panic("svcauth_unix_set_client: bad address family!"); } rqstp->rq_client = NULL; @@ -863,3 +863,4 @@ struct auth_ops svcauth_unix = { .set_client = svcauth_unix_set_client, }; + Index: linux-2.6-tip/scripts/Makefile.build =================================================================== --- linux-2.6-tip.orig/scripts/Makefile.build +++ linux-2.6-tip/scripts/Makefile.build @@ -112,13 +112,13 @@ endif # --------------------------------------------------------------------------- # Default is built-in, unless we know otherwise -modkern_cflags := $(CFLAGS_KERNEL) +modkern_cflags = $(if $(part-of-module), $(CFLAGS_MODULE), $(CFLAGS_KERNEL)) quiet_modtag := $(empty) $(empty) -$(real-objs-m) : modkern_cflags := $(CFLAGS_MODULE) -$(real-objs-m:.o=.i) : modkern_cflags := $(CFLAGS_MODULE) -$(real-objs-m:.o=.s) : modkern_cflags := $(CFLAGS_MODULE) -$(real-objs-m:.o=.lst): modkern_cflags := $(CFLAGS_MODULE) +$(real-objs-m) : part-of-module := y +$(real-objs-m:.o=.i) : part-of-module := y +$(real-objs-m:.o=.s) : part-of-module := y +$(real-objs-m:.o=.lst): part-of-module := y $(real-objs-m) : quiet_modtag := [M] $(real-objs-m:.o=.i) : quiet_modtag := [M] @@ -205,7 +205,8 @@ endif ifdef CONFIG_FTRACE_MCOUNT_RECORD cmd_record_mcount = perl $(srctree)/scripts/recordmcount.pl "$(ARCH)" \ "$(if $(CONFIG_64BIT),64,32)" \ - "$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" "$(@)"; + "$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" \ + "$(if $(part-of-module),1,0)" "$(@)"; endif define rule_cc_o_c Index: linux-2.6-tip/scripts/gcc-x86_32-has-stack-protector.sh =================================================================== --- /dev/null +++ linux-2.6-tip/scripts/gcc-x86_32-has-stack-protector.sh @@ -0,0 +1,8 @@ +#!/bin/sh + +echo "int foo(void) { char X[200]; return 3; }" | $* -S -xc -c -O0 -fstack-protector - -o - 2> /dev/null | grep -q "%gs" +if [ "$?" -eq "0" ] ; then + echo y +else + echo n +fi Index: linux-2.6-tip/scripts/gcc-x86_64-has-stack-protector.sh =================================================================== --- linux-2.6-tip.orig/scripts/gcc-x86_64-has-stack-protector.sh +++ linux-2.6-tip/scripts/gcc-x86_64-has-stack-protector.sh @@ -1,6 +1,8 @@ #!/bin/sh -echo "int foo(void) { char X[200]; return 3; }" | $1 -S -xc -c -O0 -mcmodel=kernel -fstack-protector - -o - 2> /dev/null | grep -q "%gs" +echo "int foo(void) { char X[200]; return 3; }" | $* -S -xc -c -O0 -mcmodel=kernel -fstack-protector - -o - 2> /dev/null | grep -q "%gs" if [ "$?" -eq "0" ] ; then - echo $2 + echo y +else + echo n fi Index: linux-2.6-tip/scripts/headers_check.pl =================================================================== --- linux-2.6-tip.orig/scripts/headers_check.pl +++ linux-2.6-tip/scripts/headers_check.pl @@ -38,7 +38,7 @@ foreach my $file (@files) { &check_asm_types(); &check_sizetypes(); &check_prototypes(); - &check_config(); + # Dropped for now. Too much noise &check_config(); } close FH; } Index: linux-2.6-tip/scripts/mod/modpost.c =================================================================== --- linux-2.6-tip.orig/scripts/mod/modpost.c +++ linux-2.6-tip/scripts/mod/modpost.c @@ -415,8 +415,9 @@ static int parse_elf(struct elf_info *in const char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; const char *secname; + int nobits = sechdrs[i].sh_type == SHT_NOBITS; - if (sechdrs[i].sh_offset > info->size) { + if (!nobits && sechdrs[i].sh_offset > info->size) { fatal("%s is truncated. sechdrs[i].sh_offset=%lu > " "sizeof(*hrd)=%zu\n", filename, (unsigned long)sechdrs[i].sh_offset, @@ -425,6 +426,8 @@ static int parse_elf(struct elf_info *in } secname = secstrings + sechdrs[i].sh_name; if (strcmp(secname, ".modinfo") == 0) { + if (nobits) + fatal("%s has NOBITS .modinfo\n", filename); info->modinfo = (void *)hdr + sechdrs[i].sh_offset; info->modinfo_len = sechdrs[i].sh_size; } else if (strcmp(secname, "__ksymtab") == 0) Index: linux-2.6-tip/scripts/recordmcount.pl =================================================================== --- linux-2.6-tip.orig/scripts/recordmcount.pl +++ linux-2.6-tip/scripts/recordmcount.pl @@ -100,14 +100,19 @@ $P =~ s@.*/@@g; my $V = '0.1'; -if ($#ARGV < 6) { - print "usage: $P arch objdump objcopy cc ld nm rm mv inputfile\n"; +if ($#ARGV < 7) { + print "usage: $P arch bits objdump objcopy cc ld nm rm mv is_module inputfile\n"; print "version: $V\n"; exit(1); } my ($arch, $bits, $objdump, $objcopy, $cc, - $ld, $nm, $rm, $mv, $inputfile) = @ARGV; + $ld, $nm, $rm, $mv, $is_module, $inputfile) = @ARGV; + +# This file refers to mcount and shouldn't be ftraced, so lets' ignore it +if ($inputfile eq "kernel/trace/ftrace.o") { + exit(0); +} # Acceptable sections to record. my %text_sections = ( @@ -201,6 +206,13 @@ if ($arch eq "x86_64") { $alignment = 2; $section_type = '%progbits'; +} elsif ($arch eq "ia64") { + $mcount_regex = "^\\s*([0-9a-fA-F]+):.*\\s_mcount\$"; + $type = "data8"; + + if ($is_module eq "0") { + $cc .= " -mconstant-gp"; + } } else { die "Arch $arch is not supported with CONFIG_FTRACE_MCOUNT_RECORD"; } @@ -263,7 +275,6 @@ if (!$found_version) { "\tDisabling local function references.\n"; } - # # Step 1: find all the local (static functions) and weak symbols. # 't' is local, 'w/W' is weak (we never use a weak function) @@ -331,13 +342,16 @@ sub update_funcs # # Step 2: find the sections and mcount call sites # -open(IN, "$objdump -dr $inputfile|") || die "error running $objdump"; +open(IN, "$objdump -hdr $inputfile|") || die "error running $objdump"; my $text; +my $read_headers = 1; + while () { # is it a section? if (/$section_regex/) { + $read_headers = 0; # Only record text sections that we know are safe if (defined($text_sections{$1})) { @@ -371,6 +385,19 @@ while () { $ref_func = $text; } } + } elsif ($read_headers && /$mcount_section/) { + # + # Somehow the make process can execute this script on an + # object twice. If it does, we would duplicate the mcount + # section and it will cause the function tracer self test + # to fail. Check if the mcount section exists, and if it does, + # warn and exit. + # + print STDERR "ERROR: $mcount_section already in $inputfile\n" . + "\tThis may be an indication that your build is corrupted.\n" . + "\tDelete $inputfile and try again. If the same object file\n" . + "\tstill causes an issue, then disable CONFIG_DYNAMIC_FTRACE.\n"; + exit(-1); } # is this a call site to mcount? If so, record it to print later Index: linux-2.6-tip/security/capability.c =================================================================== --- linux-2.6-tip.orig/security/capability.c +++ linux-2.6-tip/security/capability.c @@ -11,6 +11,7 @@ */ #include +#include static int cap_acct(struct file *file) { @@ -680,6 +681,9 @@ static int cap_socket_getpeersec_dgram(s static int cap_sk_alloc_security(struct sock *sk, int family, gfp_t priority) { +#ifdef CONFIG_SECURITY_NETWORK + sk->sk_security = NULL; +#endif return 0; } Index: linux-2.6-tip/security/keys/keyctl.c =================================================================== --- linux-2.6-tip.orig/security/keys/keyctl.c +++ linux-2.6-tip/security/keys/keyctl.c @@ -896,7 +896,7 @@ long keyctl_instantiate_key(key_serial_t { const struct cred *cred = current_cred(); struct request_key_auth *rka; - struct key *instkey, *dest_keyring; + struct key *instkey, *uninitialized_var(dest_keyring); void *payload; long ret; bool vm = false; @@ -974,7 +974,7 @@ long keyctl_negate_key(key_serial_t id, { const struct cred *cred = current_cred(); struct request_key_auth *rka; - struct key *instkey, *dest_keyring; + struct key *instkey, *uninitialized_var(dest_keyring); long ret; kenter("%d,%u,%d", id, timeout, ringid); Index: linux-2.6-tip/security/selinux/netnode.c =================================================================== --- linux-2.6-tip.orig/security/selinux/netnode.c +++ linux-2.6-tip/security/selinux/netnode.c @@ -140,6 +140,7 @@ static struct sel_netnode *sel_netnode_f break; default: BUG(); + return NULL; } list_for_each_entry_rcu(node, &sel_netnode_hash[idx].list, list) Index: linux-2.6-tip/sound/drivers/Kconfig =================================================================== --- linux-2.6-tip.orig/sound/drivers/Kconfig +++ linux-2.6-tip/sound/drivers/Kconfig @@ -33,7 +33,7 @@ if SND_DRIVERS config SND_PCSP tristate "PC-Speaker support (READ HELP!)" - depends on PCSPKR_PLATFORM && X86_PC && HIGH_RES_TIMERS + depends on PCSPKR_PLATFORM && X86 && HIGH_RES_TIMERS depends on INPUT depends on EXPERIMENTAL select SND_PCM @@ -91,6 +91,8 @@ config SND_VIRMIDI config SND_MTPAV tristate "MOTU MidiTimePiece AV multiport MIDI" + # sometimes crashes + depends on 0 select SND_RAWMIDI help To use a MOTU MidiTimePiece AV multiport MIDI adapter Index: linux-2.6-tip/sound/isa/sb/sb8.c =================================================================== --- linux-2.6-tip.orig/sound/isa/sb/sb8.c +++ linux-2.6-tip/sound/isa/sb/sb8.c @@ -101,7 +101,7 @@ static int __devinit snd_sb8_probe(struc struct snd_card *card; struct snd_sb8 *acard; struct snd_opl3 *opl3; - int err; + int uninitialized_var(err); card = snd_card_new(index[dev], id[dev], THIS_MODULE, sizeof(struct snd_sb8)); Index: linux-2.6-tip/sound/oss/ad1848.c =================================================================== --- linux-2.6-tip.orig/sound/oss/ad1848.c +++ linux-2.6-tip/sound/oss/ad1848.c @@ -2879,7 +2879,7 @@ static struct isapnp_device_id id_table[ {0} }; -MODULE_DEVICE_TABLE(isapnp, id_table); +MODULE_STATIC_DEVICE_TABLE(isapnp, id_table); static struct pnp_dev *activate_dev(char *devname, char *resname, struct pnp_dev *dev) { Index: linux-2.6-tip/sound/pci/pcxhr/pcxhr.c =================================================================== --- linux-2.6-tip.orig/sound/pci/pcxhr/pcxhr.c +++ linux-2.6-tip/sound/pci/pcxhr/pcxhr.c @@ -224,7 +224,7 @@ static int pcxhr_pll_freq_register(unsig static int pcxhr_get_clock_reg(struct pcxhr_mgr *mgr, unsigned int rate, unsigned int *reg, unsigned int *freq) { - unsigned int val, realfreq, pllreg; + unsigned int val, realfreq, uninitialized_var(pllreg); struct pcxhr_rmh rmh; int err; @@ -298,7 +298,9 @@ static int pcxhr_sub_set_clock(struct pc unsigned int rate, int *changed) { - unsigned int val, realfreq, speed; + unsigned int uninitialized_var(val), + uninitialized_var(realfreq), + speed; struct pcxhr_rmh rmh; int err; @@ -681,7 +683,7 @@ static void pcxhr_trigger_tasklet(unsign { unsigned long flags; int i, j, err; - struct pcxhr_pipe *pipe; + struct pcxhr_pipe *uninitialized_var(pipe); struct snd_pcxhr *chip; struct pcxhr_mgr *mgr = (struct pcxhr_mgr*)(arg); int capture_mask = 0; Index: linux-2.6-tip/sound/pci/pcxhr/pcxhr_mixer.c =================================================================== --- linux-2.6-tip.orig/sound/pci/pcxhr/pcxhr_mixer.c +++ linux-2.6-tip/sound/pci/pcxhr/pcxhr_mixer.c @@ -936,7 +936,7 @@ static int pcxhr_iec958_get(struct snd_k struct snd_ctl_elem_value *ucontrol) { struct snd_pcxhr *chip = snd_kcontrol_chip(kcontrol); - unsigned char aes_bits; + unsigned char uninitialized_var(aes_bits); int i, err; mutex_lock(&chip->mgr->mixer_mutex); @@ -1264,3 +1264,4 @@ int pcxhr_create_mixer(struct pcxhr_mgr return 0; } + Index: linux-2.6-tip/sound/pci/via82xx.c =================================================================== --- linux-2.6-tip.orig/sound/pci/via82xx.c +++ linux-2.6-tip/sound/pci/via82xx.c @@ -2428,7 +2428,7 @@ static int __devinit snd_via82xx_probe(s const struct pci_device_id *pci_id) { struct snd_card *card; - struct via82xx *chip; + struct via82xx *uninitialized_var(chip); int chip_type = 0, card_type; unsigned int i; int err; Index: linux-2.6-tip/sound/pci/via82xx_modem.c =================================================================== --- linux-2.6-tip.orig/sound/pci/via82xx_modem.c +++ linux-2.6-tip/sound/pci/via82xx_modem.c @@ -1162,7 +1162,7 @@ static int __devinit snd_via82xx_probe(s const struct pci_device_id *pci_id) { struct snd_card *card; - struct via82xx_modem *chip; + struct via82xx_modem *uninitialized_var(chip); int chip_type = 0, card_type; unsigned int i; int err; Index: linux-2.6-tip/sound/pci/vx222/vx222.c =================================================================== --- linux-2.6-tip.orig/sound/pci/vx222/vx222.c +++ linux-2.6-tip/sound/pci/vx222/vx222.c @@ -194,7 +194,7 @@ static int __devinit snd_vx222_probe(str static int dev; struct snd_card *card; struct snd_vx_hardware *hw; - struct snd_vx222 *vx; + struct snd_vx222 *uninitialized_var(vx); int err; if (dev >= SNDRV_CARDS) Index: linux-2.6-tip/arch/mn10300/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/mn10300/Kconfig +++ linux-2.6-tip/arch/mn10300/Kconfig @@ -186,6 +186,17 @@ config PREEMPT Say Y here if you are building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. +config PREEMPT_BKL + bool "Preempt The Big Kernel Lock" + depends on PREEMPT + default y + help + This option reduces the latency of the kernel by making the + big kernel lock preemptible. + + Say Y here if you are building a kernel for a desktop system. + Say N if you are unsure. + config MN10300_CURRENT_IN_E2 bool "Hold current task address in E2 register" default y Index: linux-2.6-tip/lib/kernel_lock.c =================================================================== --- linux-2.6-tip.orig/lib/kernel_lock.c +++ linux-2.6-tip/lib/kernel_lock.c @@ -11,121 +11,90 @@ #include /* - * The 'big kernel lock' + * The 'big kernel semaphore' * - * This spinlock is taken and released recursively by lock_kernel() + * This mutex is taken and released recursively by lock_kernel() * and unlock_kernel(). It is transparently dropped and reacquired * over schedule(). It is used to protect legacy code that hasn't * been migrated to a proper locking design yet. * + * Note: code locked by this semaphore will only be serialized against + * other code using the same locking facility. The code guarantees that + * the task remains on the same CPU. + * * Don't use in new code. */ -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag); - +DECLARE_MUTEX(kernel_sem); /* - * Acquire/release the underlying lock from the scheduler. + * Re-acquire the kernel semaphore. * - * This is called with preemption disabled, and should - * return an error value if it cannot get the lock and - * TIF_NEED_RESCHED gets set. + * This function is called with preemption off. * - * If it successfully gets the lock, it should increment - * the preemption count like any spinlock does. + * We are executing in schedule() so the code must be extremely careful + * about recursion, both due to the down() and due to the enabling of + * preemption. schedule() will re-check the preemption flag after + * reacquiring the semaphore. * - * (This works on UP too - _raw_spin_trylock will never - * return false in that case) + * Called with interrupts disabled. */ int __lockfunc __reacquire_kernel_lock(void) { - while (!_raw_spin_trylock(&kernel_flag)) { - if (test_thread_flag(TIF_NEED_RESCHED)) - return -EAGAIN; - cpu_relax(); - } - preempt_disable(); + struct task_struct *task = current; + int saved_lock_depth = task->lock_depth; + + local_irq_enable(); + BUG_ON(saved_lock_depth < 0); + + task->lock_depth = -1; + + down(&kernel_sem); + + task->lock_depth = saved_lock_depth; + + local_irq_disable(); + return 0; } void __lockfunc __release_kernel_lock(void) { - _raw_spin_unlock(&kernel_flag); - preempt_enable_no_resched(); + up(&kernel_sem); } /* - * These are the BKL spinlocks - we try to be polite about preemption. - * If SMP is not on (ie UP preemption), this all goes away because the - * _raw_spin_trylock() will always succeed. + * Getting the big kernel semaphore. */ -#ifdef CONFIG_PREEMPT -static inline void __lock_kernel(void) +void __lockfunc lock_kernel(void) { - preempt_disable(); - if (unlikely(!_raw_spin_trylock(&kernel_flag))) { - /* - * If preemption was disabled even before this - * was called, there's nothing we can be polite - * about - just spin. - */ - if (preempt_count() > 1) { - _raw_spin_lock(&kernel_flag); - return; - } + struct task_struct *task = current; + int depth = task->lock_depth + 1; + if (likely(!depth)) { /* - * Otherwise, let's wait for the kernel lock - * with preemption enabled.. + * No recursion worries - we set up lock_depth _after_ */ - do { - preempt_enable(); - while (spin_is_locked(&kernel_flag)) - cpu_relax(); - preempt_disable(); - } while (!_raw_spin_trylock(&kernel_flag)); + down(&kernel_sem); +#ifdef CONFIG_DEBUG_RT_MUTEXES + current->last_kernel_lock = __builtin_return_address(0); +#endif } -} -#else - -/* - * Non-preemption case - just get the spinlock - */ -static inline void __lock_kernel(void) -{ - _raw_spin_lock(&kernel_flag); + task->lock_depth = depth; } -#endif -static inline void __unlock_kernel(void) +void __lockfunc unlock_kernel(void) { - /* - * the BKL is not covered by lockdep, so we open-code the - * unlocking sequence (and thus avoid the dep-chain ops): - */ - _raw_spin_unlock(&kernel_flag); - preempt_enable(); -} + struct task_struct *task = current; -/* - * Getting the big kernel lock. - * - * This cannot happen asynchronously, so we only need to - * worry about other CPU's. - */ -void __lockfunc lock_kernel(void) -{ - int depth = current->lock_depth+1; - if (likely(!depth)) - __lock_kernel(); - current->lock_depth = depth; -} + BUG_ON(task->lock_depth < 0); -void __lockfunc unlock_kernel(void) -{ - BUG_ON(current->lock_depth < 0); - if (likely(--current->lock_depth < 0)) - __unlock_kernel(); + if (likely(--task->lock_depth == -1)) { +#ifdef CONFIG_DEBUG_RT_MUTEXES + current->last_kernel_lock = NULL; +#endif + up(&kernel_sem); + } } EXPORT_SYMBOL(lock_kernel); Index: linux-2.6-tip/kernel/posix-timers.c =================================================================== --- linux-2.6-tip.orig/kernel/posix-timers.c +++ linux-2.6-tip/kernel/posix-timers.c @@ -420,6 +420,7 @@ static enum hrtimer_restart posix_timer_ static struct pid *good_sigevent(sigevent_t * event) { struct task_struct *rtn = current->group_leader; + int sig = event->sigev_signo; if ((event->sigev_notify & SIGEV_THREAD_ID ) && (!(rtn = find_task_by_vpid(event->sigev_notify_thread_id)) || @@ -428,7 +429,8 @@ static struct pid *good_sigevent(sigeven return NULL; if (((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) && - ((event->sigev_signo <= 0) || (event->sigev_signo > SIGRTMAX))) + (sig <= 0 || sig > SIGRTMAX || sig_kernel_only(sig) || + sig_kernel_coredump(sig))) return NULL; return task_pid(rtn); @@ -787,6 +789,7 @@ retry: unlock_timer(timr, flag); if (error == TIMER_RETRY) { + hrtimer_wait_for_timer(&timr->it.real.timer); rtn = NULL; // We already got the old time... goto retry; } @@ -825,6 +828,7 @@ retry_delete: if (timer_delete_hook(timer) == TIMER_RETRY) { unlock_timer(timer, flags); + hrtimer_wait_for_timer(&timer->it.real.timer); goto retry_delete; } @@ -854,6 +858,7 @@ retry_delete: if (timer_delete_hook(timer) == TIMER_RETRY) { unlock_timer(timer, flags); + hrtimer_wait_for_timer(&timer->it.real.timer); goto retry_delete; } list_del(&timer->list); Index: linux-2.6-tip/kernel/smp.c =================================================================== --- linux-2.6-tip.orig/kernel/smp.c +++ linux-2.6-tip/kernel/smp.c @@ -10,32 +10,73 @@ #include #include #include +#include static DEFINE_PER_CPU(struct call_single_queue, call_single_queue); -static LIST_HEAD(call_function_queue); -__cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock); + +static struct { + struct list_head queue; + raw_spinlock_t lock; +} call_function __cacheline_aligned_in_smp = { + .queue = LIST_HEAD_INIT(call_function.queue), + .lock = RAW_SPIN_LOCK_UNLOCKED(call_function.lock), +}; enum { - CSD_FLAG_WAIT = 0x01, - CSD_FLAG_ALLOC = 0x02, - CSD_FLAG_LOCK = 0x04, + CSD_FLAG_LOCK = 0x01, }; struct call_function_data { struct call_single_data csd; - spinlock_t lock; + raw_spinlock_t lock; unsigned int refs; - struct rcu_head rcu_head; - unsigned long cpumask_bits[]; + cpumask_var_t cpumask; }; struct call_single_queue { struct list_head list; - spinlock_t lock; + raw_spinlock_t lock; +}; + +static DEFINE_PER_CPU(struct call_function_data, cfd_data) = { + .lock = RAW_SPIN_LOCK_UNLOCKED(cfd_data.lock), +}; + +static int +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); + + switch (action) { + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, + cpu_to_node(cpu))) + return NOTIFY_BAD; + break; + +#ifdef CONFIG_CPU_HOTPLUG + case CPU_UP_CANCELED: + case CPU_UP_CANCELED_FROZEN: + + case CPU_DEAD: + case CPU_DEAD_FROZEN: + free_cpumask_var(cfd->cpumask); + break; +#endif + }; + + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata hotplug_cfd_notifier = { + .notifier_call = hotplug_cfd, }; static int __cpuinit init_call_single_data(void) { + void *cpu = (void *)(long)smp_processor_id(); int i; for_each_possible_cpu(i) { @@ -44,29 +85,53 @@ static int __cpuinit init_call_single_da spin_lock_init(&q->lock); INIT_LIST_HEAD(&q->list); } + + hotplug_cfd(&hotplug_cfd_notifier, CPU_UP_PREPARE, cpu); + register_cpu_notifier(&hotplug_cfd_notifier); + return 0; } early_initcall(init_call_single_data); -static void csd_flag_wait(struct call_single_data *data) +/* + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources + * + * For non-synchronous ipi calls the csd can still be in use by the previous + * function call. For multi-cpu calls its even more interesting as we'll have + * to ensure no other cpu is observing our csd. + */ +static void csd_lock_wait(struct call_single_data *data) { - /* Wait for response */ - do { - if (!(data->flags & CSD_FLAG_WAIT)) - break; + while (data->flags & CSD_FLAG_LOCK) cpu_relax(); - } while (1); +} + +static void csd_lock(struct call_single_data *data) +{ + csd_lock_wait(data); + data->flags = CSD_FLAG_LOCK; +} + +static void csd_unlock(struct call_single_data *data) +{ + WARN_ON(!(data->flags & CSD_FLAG_LOCK)); + /* + * ensure we're all done before releasing data + */ + smp_mb(); + data->flags &= ~CSD_FLAG_LOCK; } /* * Insert a previously allocated call_single_data element for execution * on the given CPU. data must already have ->func, ->info, and ->flags set. */ -static void generic_exec_single(int cpu, struct call_single_data *data) +static +void generic_exec_single(int cpu, struct call_single_data *data, int wait) { struct call_single_queue *dst = &per_cpu(call_single_queue, cpu); - int wait = data->flags & CSD_FLAG_WAIT, ipi; unsigned long flags; + int ipi; spin_lock_irqsave(&dst->lock, flags); ipi = list_empty(&dst->list); @@ -74,24 +139,22 @@ static void generic_exec_single(int cpu, spin_unlock_irqrestore(&dst->lock, flags); /* - * Make the list addition visible before sending the ipi. + * The list addition should be visible before sending the IPI + * handler locks the list to pull the entry off it because of + * normal cache coherency rules implied by spinlocks. + * + * If IPIs can go out of order to the cache coherency protocol + * in an architecture, sufficient synchronisation should be added + * to arch code to make it appear to obey cache coherency WRT + * locking and barrier primitives. Generic code isn't really equipped + * to do the right thing... */ - smp_mb(); if (ipi) arch_send_call_function_single_ipi(cpu); if (wait) - csd_flag_wait(data); -} - -static void rcu_free_call_data(struct rcu_head *head) -{ - struct call_function_data *data; - - data = container_of(head, struct call_function_data, rcu_head); - - kfree(data); + csd_lock_wait(data); } /* @@ -104,44 +167,45 @@ void generic_smp_call_function_interrupt int cpu = get_cpu(); /* + * Ensure entry is visible on call_function_queue after we have + * entered the IPI. See comment in smp_call_function_many. + * If we don't have this, then we may miss an entry on the list + * and never get another IPI to process it. + */ + smp_mb(); + + /* * It's ok to use list_for_each_rcu() here even though we may delete * 'pos', since list_del_rcu() doesn't clear ->next */ - rcu_read_lock(); - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { int refs; - if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits))) + spin_lock(&data->lock); + if (!cpumask_test_cpu(cpu, data->cpumask)) { + spin_unlock(&data->lock); continue; + } + cpumask_clear_cpu(cpu, data->cpumask); + spin_unlock(&data->lock); data->csd.func(data->csd.info); spin_lock(&data->lock); - cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits)); WARN_ON(data->refs == 0); - data->refs--; - refs = data->refs; + refs = --data->refs; + if (!refs) { + spin_lock(&call_function.lock); + list_del_rcu(&data->csd.list); + spin_unlock(&call_function.lock); + } spin_unlock(&data->lock); if (refs) continue; - spin_lock(&call_function_lock); - list_del_rcu(&data->csd.list); - spin_unlock(&call_function_lock); - - if (data->csd.flags & CSD_FLAG_WAIT) { - /* - * serialize stores to data with the flag clear - * and wakeup - */ - smp_wmb(); - data->csd.flags &= ~CSD_FLAG_WAIT; - } - if (data->csd.flags & CSD_FLAG_ALLOC) - call_rcu(&data->rcu_head, rcu_free_call_data); + csd_unlock(&data->csd); } - rcu_read_unlock(); put_cpu(); } @@ -154,49 +218,34 @@ void generic_smp_call_function_single_in { struct call_single_queue *q = &__get_cpu_var(call_single_queue); LIST_HEAD(list); + unsigned int data_flags; + + spin_lock(&q->lock); + list_replace_init(&q->list, &list); + spin_unlock(&q->lock); + + while (!list_empty(&list)) { + struct call_single_data *data; + + data = list_entry(list.next, struct call_single_data, + list); + list_del(&data->list); + + /* + * 'data' can be invalid after this call if + * flags == 0 (when called through + * generic_exec_single(), so save them away before + * making the call. + */ + data_flags = data->flags; + + data->func(data->info); - /* - * Need to see other stores to list head for checking whether - * list is empty without holding q->lock - */ - smp_read_barrier_depends(); - while (!list_empty(&q->list)) { - unsigned int data_flags; - - spin_lock(&q->lock); - list_replace_init(&q->list, &list); - spin_unlock(&q->lock); - - while (!list_empty(&list)) { - struct call_single_data *data; - - data = list_entry(list.next, struct call_single_data, - list); - list_del(&data->list); - - /* - * 'data' can be invalid after this call if - * flags == 0 (when called through - * generic_exec_single(), so save them away before - * making the call. - */ - data_flags = data->flags; - - data->func(data->info); - - if (data_flags & CSD_FLAG_WAIT) { - smp_wmb(); - data->flags &= ~CSD_FLAG_WAIT; - } else if (data_flags & CSD_FLAG_LOCK) { - smp_wmb(); - data->flags &= ~CSD_FLAG_LOCK; - } else if (data_flags & CSD_FLAG_ALLOC) - kfree(data); - } /* - * See comment on outer loop + * Unlocked CSDs are valid through generic_exec_single() */ - smp_read_barrier_depends(); + if (data_flags & CSD_FLAG_LOCK) + csd_unlock(data); } } @@ -215,7 +264,9 @@ static DEFINE_PER_CPU(struct call_single int smp_call_function_single(int cpu, void (*func) (void *info), void *info, int wait) { - struct call_single_data d; + struct call_single_data d = { + .flags = 0, + }; unsigned long flags; /* prevent preemption and reschedule on another processor, as well as CPU removal */ @@ -230,45 +281,16 @@ int smp_call_function_single(int cpu, vo func(info); local_irq_restore(flags); } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) { - struct call_single_data *data; + struct call_single_data *data = &d; - if (!wait) { - /* - * We are calling a function on a single CPU - * and we are not going to wait for it to finish. - * We first try to allocate the data, but if we - * fail, we fall back to use a per cpu data to pass - * the information to that CPU. Since all callers - * of this code will use the same data, we must - * synchronize the callers to prevent a new caller - * from corrupting the data before the callee - * can access it. - * - * The CSD_FLAG_LOCK is used to let us know when - * the IPI handler is done with the data. - * The first caller will set it, and the callee - * will clear it. The next caller must wait for - * it to clear before we set it again. This - * will make sure the callee is done with the - * data before a new caller will use it. - */ - data = kmalloc(sizeof(*data), GFP_ATOMIC); - if (data) - data->flags = CSD_FLAG_ALLOC; - else { - data = &per_cpu(csd_data, me); - while (data->flags & CSD_FLAG_LOCK) - cpu_relax(); - data->flags = CSD_FLAG_LOCK; - } - } else { - data = &d; - data->flags = CSD_FLAG_WAIT; - } + if (!wait) + data = &per_cpu(csd_data, me); + + csd_lock(data); data->func = func; data->info = info; - generic_exec_single(cpu, data); + generic_exec_single(cpu, data, wait); } else { err = -ENXIO; /* CPU not online */ } @@ -288,12 +310,15 @@ EXPORT_SYMBOL(smp_call_function_single); * instance. * */ -void __smp_call_function_single(int cpu, struct call_single_data *data) +void __smp_call_function_single(int cpu, struct call_single_data *data, + int wait) { + csd_lock(data); + /* Can deadlock when called with interrupts disabled */ - WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled()); + WARN_ON(wait && irqs_disabled()); - generic_exec_single(cpu, data); + generic_exec_single(cpu, data, wait); } /* FIXME: Shim for archs using old arch_send_call_function_ipi API. */ @@ -323,14 +348,14 @@ void smp_call_function_many(const struct { struct call_function_data *data; unsigned long flags; - int cpu, next_cpu; + int cpu, next_cpu, me = smp_processor_id(); /* Can deadlock when called with interrupts disabled */ WARN_ON(irqs_disabled()); /* So, what's a CPU they want? Ignoring this one. */ cpu = cpumask_first_and(mask, cpu_online_mask); - if (cpu == smp_processor_id()) + if (cpu == me) cpu = cpumask_next_and(cpu, mask, cpu_online_mask); /* No online cpus? We're done. */ if (cpu >= nr_cpu_ids) @@ -338,7 +363,7 @@ void smp_call_function_many(const struct /* Do we have another CPU which isn't us? */ next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask); - if (next_cpu == smp_processor_id()) + if (next_cpu == me) next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask); /* Fastpath: do that cpu by itself. */ @@ -347,43 +372,39 @@ void smp_call_function_many(const struct return; } - data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC); - if (unlikely(!data)) { - /* Slow path. */ - for_each_online_cpu(cpu) { - if (cpu == smp_processor_id()) - continue; - if (cpumask_test_cpu(cpu, mask)) - smp_call_function_single(cpu, func, info, wait); - } - return; - } + data = &per_cpu(cfd_data, me); + csd_lock(&data->csd); - spin_lock_init(&data->lock); - data->csd.flags = CSD_FLAG_ALLOC; - if (wait) - data->csd.flags |= CSD_FLAG_WAIT; + spin_lock_irqsave(&data->lock, flags); data->csd.func = func; data->csd.info = info; - cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits)); - data->refs = cpumask_weight(to_cpumask(data->cpumask_bits)); - - spin_lock_irqsave(&call_function_lock, flags); - list_add_tail_rcu(&data->csd.list, &call_function_queue); - spin_unlock_irqrestore(&call_function_lock, flags); + cpumask_and(data->cpumask, mask, cpu_online_mask); + cpumask_clear_cpu(me, data->cpumask); + data->refs = cpumask_weight(data->cpumask); + + spin_lock(&call_function.lock); + /* + * Place entry at the _HEAD_ of the list, so that any cpu still + * observing the entry in generic_smp_call_function_interrupt() will + * not miss any other list entries. + */ + list_add_rcu(&data->csd.list, &call_function.queue); + spin_unlock(&call_function.lock); + spin_unlock_irqrestore(&data->lock, flags); /* * Make the list addition visible before sending the ipi. + * (IPIs must obey or appear to obey normal Linux cache coherency + * rules -- see comment in generic_exec_single). */ smp_mb(); /* Send a message to all CPUs in the map */ - arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits)); + arch_send_call_function_ipi_mask(data->cpumask); /* optionally wait for the CPUs to complete */ if (wait) - csd_flag_wait(&data->csd); + csd_lock_wait(&data->csd); } EXPORT_SYMBOL(smp_call_function_many); @@ -413,20 +434,20 @@ EXPORT_SYMBOL(smp_call_function); void ipi_call_lock(void) { - spin_lock(&call_function_lock); + spin_lock(&call_function.lock); } void ipi_call_unlock(void) { - spin_unlock(&call_function_lock); + spin_unlock(&call_function.lock); } void ipi_call_lock_irq(void) { - spin_lock_irq(&call_function_lock); + spin_lock_irq(&call_function.lock); } void ipi_call_unlock_irq(void) { - spin_unlock_irq(&call_function_lock); + spin_unlock_irq(&call_function.lock); } Index: linux-2.6-tip/block/blk-softirq.c =================================================================== --- linux-2.6-tip.orig/block/blk-softirq.c +++ linux-2.6-tip/block/blk-softirq.c @@ -64,7 +64,7 @@ static int raise_blk_irq(int cpu, struct data->info = rq; data->flags = 0; - __smp_call_function_single(cpu, data); + __smp_call_function_single(cpu, data, 0); return 0; } Index: linux-2.6-tip/include/linux/srcu.h =================================================================== --- linux-2.6-tip.orig/include/linux/srcu.h +++ linux-2.6-tip/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include + struct srcu_struct_array { int c[2]; }; @@ -50,4 +52,24 @@ void srcu_read_unlock(struct srcu_struct void synchronize_srcu(struct srcu_struct *sp); long srcu_batches_completed(struct srcu_struct *sp); +/* + * fully compatible with srcu, but optimized for writers. + */ + +struct qrcu_struct { + int completed; + atomic_t ctr[2]; + wait_queue_head_t wq; + struct mutex mutex; +}; + +int init_qrcu_struct(struct qrcu_struct *qp); +int qrcu_read_lock(struct qrcu_struct *qp); +void qrcu_read_unlock(struct qrcu_struct *qp, int idx); +void synchronize_qrcu(struct qrcu_struct *qp); + +static inline void cleanup_qrcu_struct(struct qrcu_struct *qp) +{ +} + #endif Index: linux-2.6-tip/kernel/srcu.c =================================================================== --- linux-2.6-tip.orig/kernel/srcu.c +++ linux-2.6-tip/kernel/srcu.c @@ -255,3 +255,89 @@ EXPORT_SYMBOL_GPL(srcu_read_lock); EXPORT_SYMBOL_GPL(srcu_read_unlock); EXPORT_SYMBOL_GPL(synchronize_srcu); EXPORT_SYMBOL_GPL(srcu_batches_completed); + +int init_qrcu_struct(struct qrcu_struct *qp) +{ + qp->completed = 0; + atomic_set(qp->ctr + 0, 1); + atomic_set(qp->ctr + 1, 0); + init_waitqueue_head(&qp->wq); + mutex_init(&qp->mutex); + + return 0; +} + +int qrcu_read_lock(struct qrcu_struct *qp) +{ + for (;;) { + int idx = qp->completed & 0x1; + if (likely(atomic_inc_not_zero(qp->ctr + idx))) + return idx; + } +} + +void qrcu_read_unlock(struct qrcu_struct *qp, int idx) +{ + if (atomic_dec_and_test(qp->ctr + idx)) + wake_up(&qp->wq); +} + +void synchronize_qrcu(struct qrcu_struct *qp) +{ + int idx; + + smp_mb(); /* Force preceding change to happen before fastpath check. */ + + /* + * Fastpath: If the two counters sum to "1" at a given point in + * time, there are no readers. However, it takes two separate + * loads to sample both counters, which won't occur simultaneously. + * So we might race with a counter switch, so that we might see + * ctr[0]==0, then the counter might switch, then we might see + * ctr[1]==1 (unbeknownst to us because there is a reader still + * there). So we do a read memory barrier and recheck. If the + * same race happens again, there must have been a second counter + * switch. This second counter switch could not have happened + * until all preceding readers finished, so if the condition + * is true both times, we may safely proceed. + * + * This relies critically on the atomic increment and atomic + * decrement being seen as executing in order. + */ + + if (atomic_read(&qp->ctr[0]) + atomic_read(&qp->ctr[1]) <= 1) { + smp_rmb(); /* Keep two checks independent. */ + if (atomic_read(&qp->ctr[0]) + atomic_read(&qp->ctr[1]) <= 1) + goto out; + } + + mutex_lock(&qp->mutex); + + idx = qp->completed & 0x1; + if (atomic_read(qp->ctr + idx) == 1) + goto out_unlock; + + atomic_inc(qp->ctr + (idx ^ 0x1)); + + /* + * Prevent subsequent decrement from being seen before previous + * increment -- such an inversion could cause the fastpath + * above to falsely conclude that there were no readers. Also, + * reduce the likelihood that qrcu_read_lock() will loop. + */ + + smp_mb__after_atomic_inc(); + qp->completed++; + + atomic_dec(qp->ctr + idx); + __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); +out_unlock: + mutex_unlock(&qp->mutex); +out: + smp_mb(); /* force subsequent free after qrcu_read_unlock(). */ +} + +EXPORT_SYMBOL_GPL(init_qrcu_struct); +EXPORT_SYMBOL_GPL(qrcu_read_lock); +EXPORT_SYMBOL_GPL(qrcu_read_unlock); +EXPORT_SYMBOL_GPL(synchronize_qrcu); Index: linux-2.6-tip/drivers/net/sungem.c =================================================================== --- linux-2.6-tip.orig/drivers/net/sungem.c +++ linux-2.6-tip/drivers/net/sungem.c @@ -1032,10 +1032,8 @@ static int gem_start_xmit(struct sk_buff (csum_stuff_off << 21)); } - local_irq_save(flags); - if (!spin_trylock(&gp->tx_lock)) { + if (!spin_trylock_irqsave(&gp->tx_lock, flags)) { /* Tell upper layer to requeue */ - local_irq_restore(flags); return NETDEV_TX_LOCKED; } /* We raced with gem_do_stop() */ Index: linux-2.6-tip/arch/x86/kernel/tsc_sync.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/tsc_sync.c +++ linux-2.6-tip/arch/x86/kernel/tsc_sync.c @@ -33,7 +33,7 @@ static __cpuinitdata atomic_t stop_count * we want to have the fastest, inlined, non-debug version * of a critical section, to be able to prove TSC time-warps: */ -static __cpuinitdata raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED; +static __cpuinitdata __raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED; static __cpuinitdata cycles_t last_tsc; static __cpuinitdata cycles_t max_warp; static __cpuinitdata int nr_warps; @@ -103,6 +103,7 @@ static __cpuinit void check_tsc_warp(voi */ void __cpuinit check_tsc_sync_source(int cpu) { + unsigned long flags; int cpus = 2; /* @@ -129,8 +130,11 @@ void __cpuinit check_tsc_sync_source(int /* * Wait for the target to arrive: */ + local_save_flags(flags); + local_irq_enable(); while (atomic_read(&start_count) != cpus-1) cpu_relax(); + local_irq_restore(flags); /* * Trigger the target to continue into the measurement too: */ Index: linux-2.6-tip/drivers/input/keyboard/atkbd.c =================================================================== --- linux-2.6-tip.orig/drivers/input/keyboard/atkbd.c +++ linux-2.6-tip/drivers/input/keyboard/atkbd.c @@ -1556,8 +1556,23 @@ static struct dmi_system_id atkbd_dmi_qu { } }; +static int __read_mostly noatkbd; + +static int __init noatkbd_setup(char *str) +{ + noatkbd = 1; + printk(KERN_INFO "debug: not setting up AT keyboard.\n"); + + return 1; +} + +__setup("noatkbd", noatkbd_setup); + static int __init atkbd_init(void) { + if (noatkbd) + return 0; + dmi_check_system(atkbd_dmi_quirk_table); return serio_register_driver(&atkbd_drv); Index: linux-2.6-tip/drivers/input/mouse/psmouse-base.c =================================================================== --- linux-2.6-tip.orig/drivers/input/mouse/psmouse-base.c +++ linux-2.6-tip/drivers/input/mouse/psmouse-base.c @@ -1645,10 +1645,25 @@ static int psmouse_get_maxproto(char *bu return sprintf(buffer, "%s\n", psmouse_protocol_by_type(type)->name); } +static int __read_mostly nopsmouse; + +static int __init nopsmouse_setup(char *str) +{ + nopsmouse = 1; + printk(KERN_INFO "debug: not setting up psmouse.\n"); + + return 1; +} + +__setup("nopsmouse", nopsmouse_setup); + static int __init psmouse_init(void) { int err; + if (nopsmouse) + return 0; + kpsmoused_wq = create_singlethread_workqueue("kpsmoused"); if (!kpsmoused_wq) { printk(KERN_ERR "psmouse: failed to create kpsmoused workqueue\n"); Index: linux-2.6-tip/kernel/rtmutex-debug.h =================================================================== --- linux-2.6-tip.orig/kernel/rtmutex-debug.h +++ linux-2.6-tip/kernel/rtmutex-debug.h @@ -17,17 +17,17 @@ extern void debug_rt_mutex_free_waiter(s extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name); extern void debug_rt_mutex_lock(struct rt_mutex *lock); extern void debug_rt_mutex_unlock(struct rt_mutex *lock); -extern void debug_rt_mutex_proxy_lock(struct rt_mutex *lock, - struct task_struct *powner); +extern void +debug_rt_mutex_proxy_lock(struct rt_mutex *lock, struct task_struct *powner); extern void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock); extern void debug_rt_mutex_deadlock(int detect, struct rt_mutex_waiter *waiter, struct rt_mutex *lock); extern void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter); -# define debug_rt_mutex_reset_waiter(w) \ +# define debug_rt_mutex_reset_waiter(w) \ do { (w)->deadlock_lock = NULL; } while (0) -static inline int debug_rt_mutex_detect_deadlock(struct rt_mutex_waiter *waiter, - int detect) +static inline int +debug_rt_mutex_detect_deadlock(struct rt_mutex_waiter *waiter, int detect) { - return (waiter != NULL); + return waiter != NULL; } Index: linux-2.6-tip/drivers/net/8139too.c =================================================================== --- linux-2.6-tip.orig/drivers/net/8139too.c +++ linux-2.6-tip/drivers/net/8139too.c @@ -2209,7 +2209,11 @@ static irqreturn_t rtl8139_interrupt (in */ static void rtl8139_poll_controller(struct net_device *dev) { - disable_irq(dev->irq); + /* + * use _nosync() variant - might be used by netconsole + * from atomic contexts: + */ + disable_irq_nosync(dev->irq); rtl8139_interrupt(dev->irq, dev); enable_irq(dev->irq); } Index: linux-2.6-tip/arch/x86/mm/highmem_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/mm/highmem_32.c +++ linux-2.6-tip/arch/x86/mm/highmem_32.c @@ -3,9 +3,9 @@ void *kmap(struct page *page) { - might_sleep(); if (!PageHighMem(page)) return page_address(page); + might_sleep(); return kmap_high(page); } @@ -18,6 +18,27 @@ void kunmap(struct page *page) kunmap_high(page); } +void kunmap_virt(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return; + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + kunmap(page); +} + +struct page *kmap_to_page(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return virt_to_page(ptr); + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + return page; +} +EXPORT_SYMBOL_GPL(kmap_to_page); /* PREEMPT_RT converts some modules to use this */ + static void debug_kmap_atomic_prot(enum km_type type) { #ifdef CONFIG_DEBUG_HIGHMEM @@ -69,12 +90,12 @@ static void debug_kmap_atomic_prot(enum * However when holding an atomic kmap is is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. */ -void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot) +void *__kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot) { enum fixed_addresses idx; unsigned long vaddr; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) @@ -84,19 +105,24 @@ void *kmap_atomic_prot(struct page *page idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - BUG_ON(!pte_none(*(kmap_pte-idx))); + WARN_ON_ONCE(!pte_none(*(kmap_pte-idx))); set_pte(kmap_pte-idx, mk_pte(page, prot)); arch_flush_lazy_mmu_mode(); return (void *)vaddr; } -void *kmap_atomic(struct page *page, enum km_type type) +void *__kmap_atomic_direct(struct page *page, enum km_type type) +{ + return __kmap_atomic_prot(page, type, kmap_prot); +} + +void *__kmap_atomic(struct page *page, enum km_type type) { return kmap_atomic_prot(page, type, kmap_prot); } -void kunmap_atomic(void *kvaddr, enum km_type type) +void __kunmap_atomic(void *kvaddr, enum km_type type) { unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); @@ -118,16 +144,18 @@ void kunmap_atomic(void *kvaddr, enum km arch_flush_lazy_mmu_mode(); pagefault_enable(); + preempt_enable(); } /* This is the same as kmap_atomic() but can map memory that doesn't * have a struct page associated with it. */ -void *kmap_atomic_pfn(unsigned long pfn, enum km_type type) +void *__kmap_atomic_pfn(unsigned long pfn, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; + preempt_disable(); pagefault_disable(); idx = type + KM_TYPE_NR*smp_processor_id(); @@ -137,9 +165,9 @@ void *kmap_atomic_pfn(unsigned long pfn, return (void*) vaddr; } -EXPORT_SYMBOL_GPL(kmap_atomic_pfn); /* temporarily in use by i915 GEM until vmap */ +EXPORT_SYMBOL_GPL(__kmap_atomic_pfn); /* temporarily in use by i915 GEM until vmap */ -struct page *kmap_atomic_to_page(void *ptr) +struct page *__kmap_atomic_to_page(void *ptr) { unsigned long idx, vaddr = (unsigned long)ptr; pte_t *pte; @@ -154,5 +182,6 @@ struct page *kmap_atomic_to_page(void *p EXPORT_SYMBOL(kmap); EXPORT_SYMBOL(kunmap); -EXPORT_SYMBOL(kmap_atomic); -EXPORT_SYMBOL(kunmap_atomic); +EXPORT_SYMBOL(kunmap_virt); +EXPORT_SYMBOL(__kmap_atomic); +EXPORT_SYMBOL(__kunmap_atomic); Index: linux-2.6-tip/drivers/pci/msi.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/msi.c +++ linux-2.6-tip/drivers/pci/msi.c @@ -323,6 +323,10 @@ static void __pci_restore_msi_state(stru return; entry = get_irq_msi(dev->irq); + if (!entry) { + WARN_ON(1); + return; + } pos = entry->msi_attrib.pos; pci_intx_for_msi(dev, 0); Index: linux-2.6-tip/drivers/block/floppy.c =================================================================== --- linux-2.6-tip.orig/drivers/block/floppy.c +++ linux-2.6-tip/drivers/block/floppy.c @@ -4148,6 +4148,28 @@ static void floppy_device_release(struct { } +static int floppy_suspend(struct platform_device *dev, pm_message_t state) +{ + floppy_release_irq_and_dma(); + + return 0; +} + +static int floppy_resume(struct platform_device *dev) +{ + floppy_grab_irq_and_dma(); + + return 0; +} + +static struct platform_driver floppy_driver = { + .suspend = floppy_suspend, + .resume = floppy_resume, + .driver = { + .name = "floppy", + }, +}; + static struct platform_device floppy_device[N_DRIVE]; static struct kobject *floppy_find(dev_t dev, int *part, void *data) @@ -4196,10 +4218,14 @@ static int __init floppy_init(void) if (err) goto out_put_disk; + err = platform_driver_register(&floppy_driver); + if (err) + goto out_unreg_blkdev; + floppy_queue = blk_init_queue(do_fd_request, &floppy_lock); if (!floppy_queue) { err = -ENOMEM; - goto out_unreg_blkdev; + goto out_unreg_driver; } blk_queue_max_sectors(floppy_queue, 64); @@ -4346,6 +4372,8 @@ out_flush_work: out_unreg_region: blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); blk_cleanup_queue(floppy_queue); +out_unreg_driver: + platform_driver_unregister(&floppy_driver); out_unreg_blkdev: unregister_blkdev(FLOPPY_MAJOR, "fd"); out_put_disk: @@ -4567,6 +4595,7 @@ static void __exit floppy_module_exit(vo blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); unregister_blkdev(FLOPPY_MAJOR, "fd"); + platform_driver_unregister(&floppy_driver); for (drive = 0; drive < N_DRIVE; drive++) { del_timer_sync(&motor_off_timer[drive]); Index: linux-2.6-tip/net/core/flow.c =================================================================== --- linux-2.6-tip.orig/net/core/flow.c +++ linux-2.6-tip/net/core/flow.c @@ -39,9 +39,10 @@ atomic_t flow_cache_genid = ATOMIC_INIT( static u32 flow_hash_shift; #define flow_hash_size (1 << flow_hash_shift) -static DEFINE_PER_CPU(struct flow_cache_entry **, flow_tables) = { NULL }; -#define flow_table(cpu) (per_cpu(flow_tables, cpu)) +static DEFINE_PER_CPU_LOCKED(struct flow_cache_entry **, flow_tables); + +#define flow_table(cpu) (per_cpu_var_locked(flow_tables, cpu)) static struct kmem_cache *flow_cachep __read_mostly; @@ -168,24 +169,24 @@ static int flow_key_compare(struct flowi void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family, u8 dir, flow_resolve_t resolver) { - struct flow_cache_entry *fle, **head; + struct flow_cache_entry **table, *fle, **head = NULL /* shut up GCC */; unsigned int hash; int cpu; local_bh_disable(); - cpu = smp_processor_id(); + table = get_cpu_var_locked(flow_tables, &cpu); fle = NULL; /* Packet really early in init? Making flow_cache_init a * pre-smp initcall would solve this. --RR */ - if (!flow_table(cpu)) + if (!table) goto nocache; if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); - head = &flow_table(cpu)[hash]; + head = &table[hash]; for (fle = *head; fle; fle = fle->next) { if (fle->family == family && fle->dir == dir && @@ -195,6 +196,7 @@ void *flow_cache_lookup(struct net *net, if (ret) atomic_inc(fle->object_ref); + put_cpu_var_locked(flow_tables, cpu); local_bh_enable(); return ret; @@ -220,6 +222,8 @@ void *flow_cache_lookup(struct net *net, } nocache: + put_cpu_var_locked(flow_tables, cpu); + { int err; void *obj; @@ -249,14 +253,15 @@ nocache: static void flow_cache_flush_tasklet(unsigned long data) { struct flow_flush_info *info = (void *)data; + struct flow_cache_entry **table; int i; int cpu; - cpu = smp_processor_id(); + table = get_cpu_var_locked(flow_tables, &cpu); for (i = 0; i < flow_hash_size; i++) { struct flow_cache_entry *fle; - fle = flow_table(cpu)[i]; + fle = table[i]; for (; fle; fle = fle->next) { unsigned genid = atomic_read(&flow_cache_genid); @@ -267,6 +272,7 @@ static void flow_cache_flush_tasklet(uns atomic_dec(fle->object_ref); } } + put_cpu_var_locked(flow_tables, cpu); if (atomic_dec_and_test(&info->cpuleft)) complete(&info->completion); Index: linux-2.6-tip/fs/nfs/iostat.h =================================================================== --- linux-2.6-tip.orig/fs/nfs/iostat.h +++ linux-2.6-tip/fs/nfs/iostat.h @@ -28,7 +28,7 @@ static inline void nfs_inc_server_stats( cpu = get_cpu(); iostats = per_cpu_ptr(server->io_stats, cpu); iostats->events[stat]++; - put_cpu_no_resched(); + put_cpu(); } static inline void nfs_inc_stats(const struct inode *inode, @@ -47,7 +47,7 @@ static inline void nfs_add_server_stats( cpu = get_cpu(); iostats = per_cpu_ptr(server->io_stats, cpu); iostats->bytes[stat] += addend; - put_cpu_no_resched(); + put_cpu(); } static inline void nfs_add_stats(const struct inode *inode, Index: linux-2.6-tip/drivers/net/loopback.c =================================================================== --- linux-2.6-tip.orig/drivers/net/loopback.c +++ linux-2.6-tip/drivers/net/loopback.c @@ -76,13 +76,13 @@ static int loopback_xmit(struct sk_buff skb->protocol = eth_type_trans(skb,dev); - /* it's OK to use per_cpu_ptr() because BHs are off */ pcpu_lstats = dev->ml_priv; - lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); + lb_stats = per_cpu_ptr(pcpu_lstats, get_cpu()); lb_stats->bytes += skb->len; lb_stats->packets++; + put_cpu(); - netif_rx(skb); + netif_rx_ni(skb); return 0; } Index: linux-2.6-tip/include/asm-generic/cmpxchg-local.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/cmpxchg-local.h +++ linux-2.6-tip/include/asm-generic/cmpxchg-local.h @@ -20,7 +20,7 @@ static inline unsigned long __cmpxchg_lo if (size == 8 && sizeof(unsigned long) != 8) wrong_size_cmpxchg(ptr); - local_irq_save(flags); + raw_local_irq_save(flags); switch (size) { case 1: prev = *(u8 *)ptr; if (prev == old) @@ -41,7 +41,7 @@ static inline unsigned long __cmpxchg_lo default: wrong_size_cmpxchg(ptr); } - local_irq_restore(flags); + raw_local_irq_restore(flags); return prev; } @@ -54,11 +54,11 @@ static inline u64 __cmpxchg64_local_gene u64 prev; unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); prev = *(u64 *)ptr; if (prev == old) *(u64 *)ptr = new; - local_irq_restore(flags); + raw_local_irq_restore(flags); return prev; } Index: linux-2.6-tip/kernel/Kconfig.preempt =================================================================== --- linux-2.6-tip.orig/kernel/Kconfig.preempt +++ linux-2.6-tip/kernel/Kconfig.preempt @@ -1,14 +1,13 @@ - choice - prompt "Preemption Model" - default PREEMPT_NONE + prompt "Preemption Mode" + default PREEMPT_RT config PREEMPT_NONE bool "No Forced Preemption (Server)" help - This is the traditional Linux preemption model, geared towards + This is the traditional Linux preemption model geared towards throughput. It will still provide good latencies most of the - time, but there are no guarantees and occasional longer delays + time but there are no guarantees and occasional long delays are possible. Select this option if you are building a kernel for a server or @@ -21,7 +20,7 @@ config PREEMPT_VOLUNTARY help This option reduces the latency of the kernel by adding more "explicit preemption points" to the kernel code. These new - preemption points have been selected to reduce the maximum + preemption points have been selected to minimize the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput. @@ -33,22 +32,91 @@ config PREEMPT_VOLUNTARY Select this if you are building a kernel for a desktop system. -config PREEMPT +config PREEMPT_DESKTOP bool "Preemptible Kernel (Low-Latency Desktop)" help This option reduces the latency of the kernel by making - all kernel code (that is not executing in a critical section) + all kernel code that is not executing in a critical section preemptible. This allows reaction to interactive events by permitting a low priority process to be preempted involuntarily even if it is in kernel mode executing a system call and would - otherwise not be about to reach a natural preemption point. - This allows applications to run more 'smoothly' even when the - system is under load, at the cost of slightly lower throughput - and a slight runtime overhead to kernel code. + otherwise not about to reach a preemption point. This allows + applications to run more 'smoothly' even when the system is + under load, at the cost of slighly lower throughput and a + slight runtime overhead to kernel code. + + (According to profiles, when this mode is selected then even + during kernel-intense workloads the system is in an immediately + preemptible state more than 50% of the time.) Select this if you are building a kernel for a desktop or embedded system with latency requirements in the milliseconds range. +config PREEMPT_RT + bool "Complete Preemption (Real-Time)" + select PREEMPT_SOFTIRQS + select PREEMPT_HARDIRQS + select PREEMPT_RCU + select RT_MUTEXES + help + This option further reduces the scheduling latency of the + kernel by replacing almost every spinlock used by the kernel + with preemptible mutexes and thus making all but the most + critical kernel code involuntarily preemptible. The remaining + handful of lowlevel non-preemptible codepaths are short and + have a deterministic latency of a couple of tens of + microseconds (depending on the hardware). This also allows + applications to run more 'smoothly' even when the system is + under load, at the cost of lower throughput and runtime + overhead to kernel code. + + (According to profiles, when this mode is selected then even + during kernel-intense workloads the system is in an immediately + preemptible state more than 95% of the time.) + + Select this if you are building a kernel for a desktop, + embedded or real-time system with guaranteed latency + requirements of 100 usecs or lower. + endchoice +config PREEMPT + bool + default y + depends on PREEMPT_DESKTOP || PREEMPT_RT + +config PREEMPT_SOFTIRQS + bool "Thread Softirqs" + default n +# depends on PREEMPT + help + This option reduces the latency of the kernel by 'threading' + soft interrupts. This means that all softirqs will execute + in softirqd's context. While this helps latency, it can also + reduce performance. + + The threading of softirqs can also be controlled via + /proc/sys/kernel/softirq_preemption runtime flag and the + sofirq-preempt=0/1 boot-time option. + + Say N if you are unsure. + +config PREEMPT_HARDIRQS + bool "Thread Hardirqs" + default n + depends on !GENERIC_HARDIRQS_NO__DO_IRQ + select PREEMPT_SOFTIRQS + help + This option reduces the latency of the kernel by 'threading' + hardirqs. This means that all (or selected) hardirqs will run + in their own kernel thread context. While this helps latency, + this feature can also reduce performance. + + The threading of hardirqs can also be controlled via the + /proc/sys/kernel/hardirq_preemption runtime flag and the + hardirq-preempt=0/1 boot-time option. Per-irq threading can + be enabled/disable via the /proc/irq///threaded + runtime flags. + + Say N if you are unsure. Index: linux-2.6-tip/kernel/irq/autoprobe.c =================================================================== --- linux-2.6-tip.orig/kernel/irq/autoprobe.c +++ linux-2.6-tip/kernel/irq/autoprobe.c @@ -7,6 +7,7 @@ */ #include +#include #include #include #include Index: linux-2.6-tip/include/linux/hrtimer.h =================================================================== --- linux-2.6-tip.orig/include/linux/hrtimer.h +++ linux-2.6-tip/include/linux/hrtimer.h @@ -105,6 +105,7 @@ struct hrtimer { struct hrtimer_clock_base *base; unsigned long state; struct list_head cb_entry; + int irqsafe; #ifdef CONFIG_TIMER_STATS int start_pid; void *start_site; @@ -140,6 +141,7 @@ struct hrtimer_clock_base { struct hrtimer_cpu_base *cpu_base; clockid_t index; struct rb_root active; + struct list_head expired; struct rb_node *first; ktime_t resolution; ktime_t (*get_time)(void); @@ -166,13 +168,16 @@ struct hrtimer_clock_base { * @nr_events: Total number of timer interrupt events */ struct hrtimer_cpu_base { - spinlock_t lock; + raw_spinlock_t lock; struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; #ifdef CONFIG_HIGH_RES_TIMERS ktime_t expires_next; int hres_active; unsigned long nr_events; #endif +#ifdef CONFIG_PREEMPT_SOFTIRQS + wait_queue_head_t wait; +#endif }; static inline void hrtimer_set_expires(struct hrtimer *timer, ktime_t time) @@ -355,6 +360,13 @@ static inline int hrtimer_restart(struct return hrtimer_start_expires(timer, HRTIMER_MODE_ABS); } +/* Softirq preemption could deadlock timer removal */ +#ifdef CONFIG_PREEMPT_SOFTIRQS + extern void hrtimer_wait_for_timer(const struct hrtimer *timer); +#else +# define hrtimer_wait_for_timer(timer) do { cpu_relax(); } while (0) +#endif + /* Query timers: */ extern ktime_t hrtimer_get_remaining(const struct hrtimer *timer); extern int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp); Index: linux-2.6-tip/kernel/hrtimer.c =================================================================== --- linux-2.6-tip.orig/kernel/hrtimer.c +++ linux-2.6-tip/kernel/hrtimer.c @@ -476,9 +476,9 @@ static inline int hrtimer_is_hres_enable /* * Is the high resolution mode active ? */ -static inline int hrtimer_hres_active(void) +static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base) { - return __get_cpu_var(hrtimer_bases).hres_active; + return cpu_base->hres_active; } /* @@ -534,15 +534,24 @@ static int hrtimer_reprogram(struct hrti WARN_ON_ONCE(hrtimer_get_expires_tv64(timer) < 0); +#ifndef CONFIG_PREEMPT_RT /* * When the callback is running, we do not reprogram the clock event * device. The timer callback is either running on a different CPU or * the callback is executed in the hrtimer_interrupt context. The - * reprogramming is handled either by the softirq, which called the - * callback or at the end of the hrtimer_interrupt. + * reprogramming is handled at the end of the hrtimer_interrupt. */ if (hrtimer_callback_running(timer)) return 0; +#else + /* + * preempt-rt changes the rules here as long as we have not + * solved the callback problem. For softirq based timers we + * need to allow reprogramming. + */ + if (hrtimer_callback_running(timer) && timer->irqsafe) + return 0; +#endif /* * CLOCK_REALTIME timer might be requested with an absolute @@ -573,11 +582,12 @@ static int hrtimer_reprogram(struct hrti */ static void retrigger_next_event(void *arg) { - struct hrtimer_cpu_base *base; + struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); + struct timespec realtime_offset; unsigned long seq; - if (!hrtimer_hres_active()) + if (!hrtimer_hres_active(base)) return; do { @@ -587,8 +597,6 @@ static void retrigger_next_event(void *a -wall_to_monotonic.tv_nsec); } while (read_seqretry(&xtime_lock, seq)); - base = &__get_cpu_var(hrtimer_bases); - /* Adjust CLOCK_REALTIME offset */ spin_lock(&base->lock); base->clock_base[CLOCK_REALTIME].offset = @@ -643,6 +651,8 @@ static inline void hrtimer_init_timer_hr { } +static void __run_hrtimer(struct hrtimer *timer); +static int hrtimer_rt_defer(struct hrtimer *timer); /* * When High resolution timers are active, try to reprogram. Note, that in case @@ -654,9 +664,24 @@ static inline int hrtimer_enqueue_reprog struct hrtimer_clock_base *base) { if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base)) { +#ifdef CONFIG_PREEMPT_RT + /* + * Move softirq based timers away from the rbtree in + * case it expired already. Otherwise we would have a + * stale base->first entry until the softirq runs. + */ + if (hrtimer_rt_defer(timer)) { + spin_unlock(&base->cpu_base->lock); + raise_softirq_irqoff(HRTIMER_SOFTIRQ); + spin_lock(&base->cpu_base->lock); + } else { + __run_hrtimer(timer); + } +#else spin_unlock(&base->cpu_base->lock); raise_softirq_irqoff(HRTIMER_SOFTIRQ); spin_lock(&base->cpu_base->lock); +#endif return 1; } return 0; @@ -665,10 +690,8 @@ static inline int hrtimer_enqueue_reprog /* * Switch to high resolution mode */ -static int hrtimer_switch_to_hres(void) +static int hrtimer_switch_to_hres(struct hrtimer_cpu_base *base) { - int cpu = smp_processor_id(); - struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu); unsigned long flags; if (base->hres_active) @@ -679,7 +702,7 @@ static int hrtimer_switch_to_hres(void) if (tick_init_highres()) { local_irq_restore(flags); printk(KERN_WARNING "Could not switch to high resolution " - "mode on CPU %d\n", cpu); + "mode on CPU %d\n", raw_smp_processor_id()); return 0; } base->hres_active = 1; @@ -691,16 +714,20 @@ static int hrtimer_switch_to_hres(void) /* "Retrigger" the interrupt to get things going */ retrigger_next_event(NULL); local_irq_restore(flags); - printk(KERN_DEBUG "Switched to high resolution mode on CPU %d\n", - smp_processor_id()); return 1; } #else -static inline int hrtimer_hres_active(void) { return 0; } +static inline int hrtimer_hres_active(struct hrtimer_cpu_base *base) +{ + return 0; +} static inline int hrtimer_is_hres_enabled(void) { return 0; } -static inline int hrtimer_switch_to_hres(void) { return 0; } +static inline int hrtimer_switch_to_hres(struct hrtimer_cpu_base *base) +{ + return 0; +} static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base) { } static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, struct hrtimer_clock_base *base) @@ -829,6 +856,32 @@ static int enqueue_hrtimer(struct hrtime return leftmost; } +#ifdef CONFIG_PREEMPT_SOFTIRQS +# define wake_up_timer_waiters(b) wake_up(&(b)->wait) + +/** + * hrtimer_wait_for_timer - Wait for a running timer + * + * @timer: timer to wait for + * + * The function waits in case the timers callback function is + * currently executed on the waitqueue of the timer base. The + * waitqueue is woken up after the timer callback function has + * finished execution. + */ +void hrtimer_wait_for_timer(const struct hrtimer *timer) +{ + struct hrtimer_clock_base *base = timer->base; + + if (base && base->cpu_base && !hrtimer_hres_active(base->cpu_base)) + wait_event(base->cpu_base->wait, + !(timer->state & HRTIMER_STATE_CALLBACK)); +} + +#else +# define wake_up_timer_waiters(b) do { } while (0) +#endif + /* * __remove_hrtimer - internal function to remove a timer * @@ -844,6 +897,11 @@ static void __remove_hrtimer(struct hrti unsigned long newstate, int reprogram) { if (timer->state & HRTIMER_STATE_ENQUEUED) { + + if (unlikely(!list_empty(&timer->cb_entry))) { + list_del_init(&timer->cb_entry); + goto out; + } /* * Remove the timer from the rbtree and replace the * first entry pointer if necessary. @@ -851,11 +909,12 @@ static void __remove_hrtimer(struct hrti if (base->first == &timer->node) { base->first = rb_next(&timer->node); /* Reprogram the clock event device. if enabled */ - if (reprogram && hrtimer_hres_active()) + if (reprogram && hrtimer_hres_active(base->cpu_base)) hrtimer_force_reprogram(base->cpu_base); } rb_erase(&timer->node, &base->active); } +out: timer->state = newstate; } @@ -1009,7 +1068,7 @@ int hrtimer_cancel(struct hrtimer *timer if (ret >= 0) return ret; - cpu_relax(); + hrtimer_wait_for_timer(timer); } } EXPORT_SYMBOL_GPL(hrtimer_cancel); @@ -1049,7 +1108,7 @@ ktime_t hrtimer_get_next_event(void) spin_lock_irqsave(&cpu_base->lock, flags); - if (!hrtimer_hres_active()) { + if (!hrtimer_hres_active(cpu_base)) { for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) { struct hrtimer *timer; @@ -1163,6 +1222,77 @@ static void __run_hrtimer(struct hrtimer timer->state &= ~HRTIMER_STATE_CALLBACK; } +#ifdef CONFIG_PREEMPT_RT + +/* + * The changes in mainline which removed the callback modes from + * hrtimer are not yet working with -rt. The non wakeup_process() + * based callbacks which involve sleeping locks need to be treated + * seperately. + */ +static void hrtimer_rt_run_pending(void) +{ + enum hrtimer_restart (*fn)(struct hrtimer *); + struct hrtimer_cpu_base *cpu_base; + struct hrtimer_clock_base *base; + struct hrtimer *timer; + int index, restart; + + local_irq_disable(); + cpu_base = &per_cpu(hrtimer_bases, smp_processor_id()); + + spin_lock(&cpu_base->lock); + + for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { + base = &cpu_base->clock_base[index]; + + while (!list_empty(&base->expired)) { + timer = list_first_entry(&base->expired, + struct hrtimer, cb_entry); + + /* + * Same as the above __run_hrtimer function + * just we run with interrupts enabled. + */ + debug_hrtimer_deactivate(timer); + __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); + timer_stats_account_hrtimer(timer); + fn = timer->function; + + spin_unlock_irq(&cpu_base->lock); + restart = fn(timer); + spin_lock_irq(&cpu_base->lock); + + if (restart != HRTIMER_NORESTART) { + BUG_ON(timer->state != HRTIMER_STATE_CALLBACK); + enqueue_hrtimer(timer, base); + } + timer->state &= ~HRTIMER_STATE_CALLBACK; + } + } + + spin_unlock_irq(&cpu_base->lock); + + wake_up_timer_waiters(cpu_base); +} + +static int hrtimer_rt_defer(struct hrtimer *timer) +{ + if (timer->irqsafe) + return 0; + + __remove_hrtimer(timer, timer->base, timer->state, 0); + list_add_tail(&timer->cb_entry, &timer->base->expired); + return 1; +} + +#else + +static inline void hrtimer_rt_run_pending(void) { } +static inline int hrtimer_rt_defer(struct hrtimer *timer) { return 0; } + +#endif + #ifdef CONFIG_HIGH_RES_TIMERS static int force_clock_reprogram; @@ -1198,7 +1328,7 @@ void hrtimer_interrupt(struct clock_even struct hrtimer_clock_base *base; ktime_t expires_next, now; int nr_retries = 0; - int i; + int i, raise = 0; BUG_ON(!cpu_base->hres_active); cpu_base->nr_events++; @@ -1251,7 +1381,10 @@ void hrtimer_interrupt(struct clock_even break; } - __run_hrtimer(timer); + if (!hrtimer_rt_defer(timer)) + __run_hrtimer(timer); + else + raise = 1; } spin_unlock(&cpu_base->lock); base++; @@ -1264,6 +1397,9 @@ void hrtimer_interrupt(struct clock_even if (tick_program_event(expires_next, force_clock_reprogram)) goto retry; } + + if (raise) + raise_softirq_irqoff(HRTIMER_SOFTIRQ); } /* @@ -1272,9 +1408,11 @@ void hrtimer_interrupt(struct clock_even */ static void __hrtimer_peek_ahead_timers(void) { + struct hrtimer_cpu_base *cpu_base; struct tick_device *td; - if (!hrtimer_hres_active()) + cpu_base = &__get_cpu_var(hrtimer_bases); + if (!hrtimer_hres_active(cpu_base)) return; td = &__get_cpu_var(tick_cpu_device); @@ -1300,17 +1438,18 @@ void hrtimer_peek_ahead_timers(void) local_irq_restore(flags); } -static void run_hrtimer_softirq(struct softirq_action *h) -{ - hrtimer_peek_ahead_timers(); -} - #else /* CONFIG_HIGH_RES_TIMERS */ static inline void __hrtimer_peek_ahead_timers(void) { } #endif /* !CONFIG_HIGH_RES_TIMERS */ +static void run_hrtimer_softirq(struct softirq_action *h) +{ + hrtimer_peek_ahead_timers(); + hrtimer_rt_run_pending(); +} + /* * Called from timer softirq every jiffy, expire hrtimers: * @@ -1320,7 +1459,9 @@ static inline void __hrtimer_peek_ahead_ */ void hrtimer_run_pending(void) { - if (hrtimer_hres_active()) + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases); + + if (hrtimer_hres_active(cpu_base)) return; /* @@ -1332,7 +1473,7 @@ void hrtimer_run_pending(void) * deadlock vs. xtime_lock. */ if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) - hrtimer_switch_to_hres(); + hrtimer_switch_to_hres(cpu_base); } /* @@ -1341,11 +1482,12 @@ void hrtimer_run_pending(void) void hrtimer_run_queues(void) { struct rb_node *node; - struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases); + struct hrtimer_cpu_base *cpu_base; struct hrtimer_clock_base *base; - int index, gettime = 1; + int index, gettime = 1, raise = 0; - if (hrtimer_hres_active()) + cpu_base = &per_cpu(hrtimer_bases, raw_smp_processor_id()); + if (hrtimer_hres_active(cpu_base)) return; for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) { @@ -1369,10 +1511,16 @@ void hrtimer_run_queues(void) hrtimer_get_expires_tv64(timer)) break; - __run_hrtimer(timer); + if (!hrtimer_rt_defer(timer)) + __run_hrtimer(timer); + else + raise = 1; } spin_unlock(&cpu_base->lock); } + + if (raise) + raise_softirq_irqoff(HRTIMER_SOFTIRQ); } /* @@ -1394,6 +1542,7 @@ static enum hrtimer_restart hrtimer_wake void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *task) { sl->timer.function = hrtimer_wakeup; + sl->timer.irqsafe = 1; sl->task = task; } @@ -1528,10 +1677,15 @@ static void __cpuinit init_hrtimers_cpu( spin_lock_init(&cpu_base->lock); - for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) + for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { cpu_base->clock_base[i].cpu_base = cpu_base; + INIT_LIST_HEAD(&cpu_base->clock_base[i].expired); + } hrtimer_init_hres(cpu_base); +#ifdef CONFIG_PREEMPT_SOFTIRQS + init_waitqueue_head(&cpu_base->wait); +#endif } #ifdef CONFIG_HOTPLUG_CPU @@ -1644,9 +1798,7 @@ void __init hrtimers_init(void) hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE, (void *)(long)smp_processor_id()); register_cpu_notifier(&hrtimers_nb); -#ifdef CONFIG_HIGH_RES_TIMERS open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq); -#endif } /** Index: linux-2.6-tip/kernel/itimer.c =================================================================== --- linux-2.6-tip.orig/kernel/itimer.c +++ linux-2.6-tip/kernel/itimer.c @@ -161,6 +161,7 @@ again: /* We are sharing ->siglock with it_real_fn() */ if (hrtimer_try_to_cancel(timer) < 0) { spin_unlock_irq(&tsk->sighand->siglock); + hrtimer_wait_for_timer(&tsk->signal->real_timer); goto again; } expires = timeval_to_ktime(value->it_value); Index: linux-2.6-tip/include/linux/bottom_half.h =================================================================== --- linux-2.6-tip.orig/include/linux/bottom_half.h +++ linux-2.6-tip/include/linux/bottom_half.h @@ -1,9 +1,17 @@ #ifndef _LINUX_BH_H #define _LINUX_BH_H +#ifdef CONFIG_PREEMPT_HARDIRQS +# define local_bh_disable() do { } while (0) +# define __local_bh_disable(ip) do { } while (0) +# define _local_bh_enable() do { } while (0) +# define local_bh_enable() do { } while (0) +# define local_bh_enable_ip(ip) do { } while (0) +#else extern void local_bh_disable(void); extern void _local_bh_enable(void); extern void local_bh_enable(void); extern void local_bh_enable_ip(unsigned long ip); +#endif #endif /* _LINUX_BH_H */ Index: linux-2.6-tip/include/linux/preempt.h =================================================================== --- linux-2.6-tip.orig/include/linux/preempt.h +++ linux-2.6-tip/include/linux/preempt.h @@ -9,6 +9,7 @@ #include #include #include +#include #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) extern void add_preempt_count(int val); @@ -21,11 +22,12 @@ #define inc_preempt_count() add_preempt_count(1) #define dec_preempt_count() sub_preempt_count(1) -#define preempt_count() (current_thread_info()->preempt_count) +#define preempt_count() (current_thread_info()->preempt_count) #ifdef CONFIG_PREEMPT asmlinkage void preempt_schedule(void); +asmlinkage void preempt_schedule_irq(void); #define preempt_disable() \ do { \ @@ -33,12 +35,19 @@ do { \ barrier(); \ } while (0) -#define preempt_enable_no_resched() \ +#define __preempt_enable_no_resched() \ do { \ barrier(); \ dec_preempt_count(); \ } while (0) + +#ifdef CONFIG_DEBUG_PREEMPT +extern void notrace preempt_enable_no_resched(void); +#else +# define preempt_enable_no_resched() __preempt_enable_no_resched() +#endif + #define preempt_check_resched() \ do { \ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \ @@ -47,7 +56,7 @@ do { \ #define preempt_enable() \ do { \ - preempt_enable_no_resched(); \ + __preempt_enable_no_resched(); \ barrier(); \ preempt_check_resched(); \ } while (0) @@ -84,6 +93,7 @@ do { \ #define preempt_disable() do { } while (0) #define preempt_enable_no_resched() do { } while (0) +#define __preempt_enable_no_resched() do { } while (0) #define preempt_enable() do { } while (0) #define preempt_check_resched() do { } while (0) @@ -91,6 +101,8 @@ do { \ #define preempt_enable_no_resched_notrace() do { } while (0) #define preempt_enable_notrace() do { } while (0) +#define preempt_schedule_irq() do { } while (0) + #endif #ifdef CONFIG_PREEMPT_NOTIFIERS Index: linux-2.6-tip/net/ipv4/tcp.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/tcp.c +++ linux-2.6-tip/net/ipv4/tcp.c @@ -1322,11 +1322,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && !sysctl_tcp_low_latency && dma_find_channel(DMA_MEMCPY)) { - preempt_enable_no_resched(); + preempt_enable(); tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); } else { - preempt_enable_no_resched(); + preempt_enable(); } } #endif Index: linux-2.6-tip/net/ipv4/route.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/route.c +++ linux-2.6-tip/net/ipv4/route.c @@ -204,13 +204,13 @@ struct rt_hash_bucket { }; #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \ - defined(CONFIG_PROVE_LOCKING) + defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_PREEMPT_RT) /* * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks * The size of this table is a power of two and depends on the number of CPUS. * (on lockdep we have a quite big spinlock_t, so keep the size down there) */ -#ifdef CONFIG_LOCKDEP +#if defined(CONFIG_LOCKDEP) || defined(CONFIG_PREEMPT_RT) # define RT_HASH_LOCK_SZ 256 #else # if NR_CPUS >= 32 @@ -242,7 +242,7 @@ static __init void rt_hash_lock_init(voi spin_lock_init(&rt_hash_locks[i]); } #else -# define rt_hash_lock_addr(slot) NULL +# define rt_hash_lock_addr(slot) ((spinlock_t *)NULL) static inline void rt_hash_lock_init(void) { Index: linux-2.6-tip/arch/x86/include/asm/rwsem.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/rwsem.h +++ linux-2.6-tip/arch/x86/include/asm/rwsem.h @@ -44,14 +44,14 @@ struct rwsem_waiter; -extern asmregparm struct rw_semaphore * - rwsem_down_read_failed(struct rw_semaphore *sem); -extern asmregparm struct rw_semaphore * - rwsem_down_write_failed(struct rw_semaphore *sem); -extern asmregparm struct rw_semaphore * - rwsem_wake(struct rw_semaphore *); -extern asmregparm struct rw_semaphore * - rwsem_downgrade_wake(struct rw_semaphore *sem); +extern asmregparm struct compat_rw_semaphore * + rwsem_down_read_failed(struct compat_rw_semaphore *sem); +extern asmregparm struct compat_rw_semaphore * + rwsem_down_write_failed(struct compat_rw_semaphore *sem); +extern asmregparm struct compat_rw_semaphore * + rwsem_wake(struct compat_rw_semaphore *); +extern asmregparm struct compat_rw_semaphore * + rwsem_downgrade_wake(struct compat_rw_semaphore *sem); /* * the semaphore definition @@ -64,7 +64,7 @@ extern asmregparm struct rw_semaphore * #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) -struct rw_semaphore { +struct compat_rw_semaphore { signed long count; spinlock_t wait_lock; struct list_head wait_list; @@ -86,23 +86,23 @@ struct rw_semaphore { LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) \ } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __RWSEM_INITIALIZER(name) -extern void __init_rwsem(struct rw_semaphore *sem, const char *name, +extern void __compat_init_rwsem(struct compat_rw_semaphore *sem, const char *name, struct lock_class_key *key); -#define init_rwsem(sem) \ +#define compat_init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ - __init_rwsem((sem), #sem, &__key); \ + __compat_init_rwsem((sem), #sem, &__key); \ } while (0) /* * lock for reading */ -static inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct compat_rw_semaphore *sem) { asm volatile("# beginning down_read\n\t" LOCK_PREFIX " incl (%%eax)\n\t" @@ -119,7 +119,7 @@ static inline void __down_read(struct rw /* * trylock for reading -- returns 1 if successful, 0 if contention */ -static inline int __down_read_trylock(struct rw_semaphore *sem) +static inline int __down_read_trylock(struct compat_rw_semaphore *sem) { __s32 result, tmp; asm volatile("# beginning __down_read_trylock\n\t" @@ -141,7 +141,8 @@ static inline int __down_read_trylock(st /* * lock for writing */ -static inline void __down_write_nested(struct rw_semaphore *sem, int subclass) +static inline void +__down_write_nested(struct compat_rw_semaphore *sem, int subclass) { int tmp; @@ -160,7 +161,7 @@ static inline void __down_write_nested(s : "memory", "cc"); } -static inline void __down_write(struct rw_semaphore *sem) +static inline void __down_write(struct compat_rw_semaphore *sem) { __down_write_nested(sem, 0); } @@ -168,7 +169,7 @@ static inline void __down_write(struct r /* * trylock for writing -- returns 1 if successful, 0 if contention */ -static inline int __down_write_trylock(struct rw_semaphore *sem) +static inline int __down_write_trylock(struct compat_rw_semaphore *sem) { signed long ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, @@ -181,7 +182,7 @@ static inline int __down_write_trylock(s /* * unlock after reading */ -static inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct compat_rw_semaphore *sem) { __s32 tmp = -RWSEM_ACTIVE_READ_BIAS; asm volatile("# beginning __up_read\n\t" @@ -199,7 +200,7 @@ static inline void __up_read(struct rw_s /* * unlock after writing */ -static inline void __up_write(struct rw_semaphore *sem) +static inline void __up_write(struct compat_rw_semaphore *sem) { asm volatile("# beginning __up_write\n\t" " movl %2,%%edx\n\t" @@ -218,7 +219,7 @@ static inline void __up_write(struct rw_ /* * downgrade write lock to read lock */ -static inline void __downgrade_write(struct rw_semaphore *sem) +static inline void __downgrade_write(struct compat_rw_semaphore *sem) { asm volatile("# beginning __downgrade_write\n\t" LOCK_PREFIX " addl %2,(%%eax)\n\t" @@ -235,7 +236,7 @@ static inline void __downgrade_write(str /* * implement atomic add functionality */ -static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem) +static inline void rwsem_atomic_add(int delta, struct compat_rw_semaphore *sem) { asm volatile(LOCK_PREFIX "addl %1,%0" : "+m" (sem->count) @@ -245,7 +246,7 @@ static inline void rwsem_atomic_add(int /* * implement exchange and add functionality */ -static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) +static inline int rwsem_atomic_update(int delta, struct compat_rw_semaphore *sem) { int tmp = delta; @@ -256,7 +257,7 @@ static inline int rwsem_atomic_update(in return tmp + delta; } -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->count != 0); } Index: linux-2.6-tip/arch/x86/include/asm/spinlock_types.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/spinlock_types.h +++ linux-2.6-tip/arch/x86/include/asm/spinlock_types.h @@ -7,13 +7,13 @@ typedef struct raw_spinlock { unsigned int slock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 0 } typedef struct { unsigned int lock; -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { RW_LOCK_BIAS } Index: linux-2.6-tip/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6-tip/arch/x86/kernel/vsyscall_64.c @@ -59,7 +59,7 @@ int __vgetcpu_mode __section_vgetcpu_mod struct vsyscall_gtod_data __vsyscall_gtod_data __section_vsyscall_gtod_data = { - .lock = SEQLOCK_UNLOCKED, + .lock = __RAW_SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock), .sysctl_enabled = 1, }; @@ -78,14 +78,40 @@ void update_vsyscall(struct timespec *wa unsigned long flags; write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags); + + if (likely(vsyscall_gtod_data.sysctl_enabled == 2)) { + struct timespec tmp = *(wall_time); + cycle_t (*vread)(void); + cycle_t now; + + vread = vsyscall_gtod_data.clock.vread; + if (likely(vread)) + now = vread(); + else + now = clock->read(); + + /* calculate interval: */ + now = (now - clock->cycle_last) & clock->mask; + /* convert to nsecs: */ + tmp.tv_nsec += ( now * clock->mult) >> clock->shift; + + while (tmp.tv_nsec >= NSEC_PER_SEC) { + tmp.tv_sec += 1; + tmp.tv_nsec -= NSEC_PER_SEC; + } + + vsyscall_gtod_data.wall_time_sec = tmp.tv_sec; + vsyscall_gtod_data.wall_time_nsec = tmp.tv_nsec; + } else { + vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; + vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; + } /* copy vsyscall data */ vsyscall_gtod_data.clock.vread = clock->vread; vsyscall_gtod_data.clock.cycle_last = clock->cycle_last; vsyscall_gtod_data.clock.mask = clock->mask; vsyscall_gtod_data.clock.mult = clock->mult; vsyscall_gtod_data.clock.shift = clock->shift; - vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; - vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic; write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags); } @@ -123,6 +149,26 @@ static __always_inline void do_vgettimeo unsigned seq; unsigned long mult, shift, nsec; cycle_t (*vread)(void); + + if (likely(__vsyscall_gtod_data.sysctl_enabled == 2)) { + struct timeval tmp; + + do { + barrier(); + tv->tv_sec = __vsyscall_gtod_data.wall_time_sec; + tv->tv_usec = __vsyscall_gtod_data.wall_time_nsec; + barrier(); + tmp.tv_sec = __vsyscall_gtod_data.wall_time_sec; + tmp.tv_usec = __vsyscall_gtod_data.wall_time_nsec; + + } while (tmp.tv_usec != tv->tv_usec || + tmp.tv_sec != tv->tv_sec); + + tv->tv_usec /= NSEC_PER_MSEC; + tv->tv_usec *= USEC_PER_MSEC; + return; + } + do { seq = read_seqbegin(&__vsyscall_gtod_data.lock); @@ -138,7 +184,6 @@ static __always_inline void do_vgettimeo * does not cause time warps: */ rdtsc_barrier(); - now = vread(); rdtsc_barrier(); base = __vsyscall_gtod_data.clock.cycle_last; @@ -150,6 +195,7 @@ static __always_inline void do_vgettimeo nsec = __vsyscall_gtod_data.wall_time_nsec; } while (read_seqretry(&__vsyscall_gtod_data.lock, seq)); + now = vread(); /* calculate interval: */ cycle_delta = (now - base) & mask; /* convert to nsecs: */ Index: linux-2.6-tip/drivers/input/ff-memless.c =================================================================== --- linux-2.6-tip.orig/drivers/input/ff-memless.c +++ linux-2.6-tip/drivers/input/ff-memless.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include Index: linux-2.6-tip/fs/proc/array.c =================================================================== --- linux-2.6-tip.orig/fs/proc/array.c +++ linux-2.6-tip/fs/proc/array.c @@ -133,12 +133,13 @@ static inline void task_name(struct seq_ */ static const char *task_state_array[] = { "R (running)", /* 0 */ - "S (sleeping)", /* 1 */ - "D (disk sleep)", /* 2 */ - "T (stopped)", /* 4 */ - "T (tracing stop)", /* 8 */ - "Z (zombie)", /* 16 */ - "X (dead)" /* 32 */ + "M (running-mutex)", /* 1 */ + "S (sleeping)", /* 2 */ + "D (disk sleep)", /* 4 */ + "T (stopped)", /* 8 */ + "T (tracing stop)", /* 16 */ + "Z (zombie)", /* 32 */ + "X (dead)" /* 64 */ }; static inline const char *get_task_state(struct task_struct *tsk) @@ -320,6 +321,19 @@ static inline void task_context_switch_c p->nivcsw); } +#define get_blocked_on(t) (-1) + +static inline void show_blocked_on(struct seq_file *m, struct task_struct *p) +{ + pid_t pid = get_blocked_on(p); + + if (pid < 0) + return; + + seq_printf(m, "BlckOn: %d\n", pid); +} + + int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { @@ -339,6 +353,7 @@ int proc_pid_status(struct seq_file *m, task_show_regs(m, task); #endif task_context_switch_counts(m, task); + show_blocked_on(m, task); return 0; } Index: linux-2.6-tip/include/linux/bit_spinlock.h =================================================================== --- linux-2.6-tip.orig/include/linux/bit_spinlock.h +++ linux-2.6-tip/include/linux/bit_spinlock.h @@ -1,6 +1,8 @@ #ifndef __LINUX_BIT_SPINLOCK_H #define __LINUX_BIT_SPINLOCK_H +#if 0 + /* * bit-based spin_lock() * @@ -91,5 +93,7 @@ static inline int bit_spin_is_locked(int #endif } +#endif + #endif /* __LINUX_BIT_SPINLOCK_H */ Index: linux-2.6-tip/include/linux/pickop.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/linux/pickop.h @@ -0,0 +1,32 @@ +#ifndef _LINUX_PICKOP_H +#define _LINUX_PICKOP_H + +#undef PICK_TYPE_EQUAL +#define PICK_TYPE_EQUAL(var, type) \ + __builtin_types_compatible_p(typeof(var), type) + +extern int __bad_func_type(void); + +#define PICK_FUNCTION(type1, type2, func1, func2, arg0, ...) \ +do { \ + if (PICK_TYPE_EQUAL((arg0), type1)) \ + func1((type1)(arg0), ##__VA_ARGS__); \ + else if (PICK_TYPE_EQUAL((arg0), type2)) \ + func2((type2)(arg0), ##__VA_ARGS__); \ + else __bad_func_type(); \ +} while (0) + +#define PICK_FUNCTION_RET(type1, type2, func1, func2, arg0, ...) \ +({ \ + unsigned long __ret; \ + \ + if (PICK_TYPE_EQUAL((arg0), type1)) \ + __ret = func1((type1)(arg0), ##__VA_ARGS__); \ + else if (PICK_TYPE_EQUAL((arg0), type2)) \ + __ret = func2((type2)(arg0), ##__VA_ARGS__); \ + else __ret = __bad_func_type(); \ + \ + __ret; \ +}) + +#endif /* _LINUX_PICKOP_H */ Index: linux-2.6-tip/include/linux/rt_lock.h =================================================================== --- /dev/null +++ linux-2.6-tip/include/linux/rt_lock.h @@ -0,0 +1,274 @@ +#ifndef __LINUX_RT_LOCK_H +#define __LINUX_RT_LOCK_H + +/* + * Real-Time Preemption Support + * + * started by Ingo Molnar: + * + * Copyright (C) 2004, 2005 Red Hat, Inc., Ingo Molnar + * + * This file contains the main data structure definitions. + */ +#include +#include +#include + +#ifdef CONFIG_PREEMPT_RT +# define preempt_rt 1 +/* + * spinlocks - an RT mutex plus lock-break field: + */ +typedef struct { + struct rt_mutex lock; + unsigned int break_lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} spinlock_t; + +#ifdef CONFIG_DEBUG_RT_MUTEXES +# define __RT_SPIN_INITIALIZER(name) \ + { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ + .save_state = 1, \ + .file = __FILE__, \ + .line = __LINE__, } +#else +# define __RT_SPIN_INITIALIZER(name) \ + { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) } +#endif + +#define __SPIN_LOCK_UNLOCKED(name) (spinlock_t) \ + { .lock = __RT_SPIN_INITIALIZER(name), \ + SPIN_DEP_MAP_INIT(name) } + +#else /* !PREEMPT_RT */ + +typedef raw_spinlock_t spinlock_t; + +#define __SPIN_LOCK_UNLOCKED _RAW_SPIN_LOCK_UNLOCKED + +#endif + +#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(spin_old_style) + + +#define __DEFINE_SPINLOCK(name) \ + spinlock_t name = __SPIN_LOCK_UNLOCKED(name) + +#define DEFINE_SPINLOCK(name) \ + spinlock_t name __cacheline_aligned_in_smp = __SPIN_LOCK_UNLOCKED(name) + +#ifdef CONFIG_PREEMPT_RT + +/* + * RW-semaphores are a spinlock plus a reader-depth count. + * + * Note that the semantics are different from the usual + * Linux rw-sems, in PREEMPT_RT mode we do not allow + * multiple readers to hold the lock at once, we only allow + * a read-lock owner to read-lock recursively. This is + * better for latency, makes the implementation inherently + * fair and makes it simpler as well: + */ +struct rw_semaphore { + struct rt_mutex lock; + int read_depth; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +/* + * rwlocks - an RW semaphore plus lock-break field: + */ +typedef struct { + struct rt_mutex lock; + int read_depth; + unsigned int break_lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} rwlock_t; + +#define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ + { .lock = __RT_SPIN_INITIALIZER(name), \ + RW_DEP_MAP_INIT(name) } +#else /* !PREEMPT_RT */ + +typedef raw_rwlock_t rwlock_t; + +#define __RW_LOCK_UNLOCKED _RAW_RW_LOCK_UNLOCKED + +#endif + +#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(rw_old_style) + + +#define DEFINE_RWLOCK(name) \ + rwlock_t name __cacheline_aligned_in_smp = __RW_LOCK_UNLOCKED(name) + +#ifdef CONFIG_PREEMPT_RT + +/* + * Semaphores - a spinlock plus the semaphore count: + */ +struct semaphore { + atomic_t count; + struct rt_mutex lock; +}; + +#define DECLARE_MUTEX(name) \ +struct semaphore name = \ + { .count = { 1 }, .lock = __RT_MUTEX_INITIALIZER(name.lock) } + +extern void +__sema_init(struct semaphore *sem, int val, char *name, char *file, int line); + +#define rt_sema_init(sem, val) \ + __sema_init(sem, val, #sem, __FILE__, __LINE__) + +extern void +__init_MUTEX(struct semaphore *sem, char *name, char *file, int line); +#define rt_init_MUTEX(sem) \ + __init_MUTEX(sem, #sem, __FILE__, __LINE__) + +extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void); + +/* + * No locked initialization for RT semaphores + */ +#define rt_init_MUTEX_LOCKED(sem) \ + there_is_no_init_MUTEX_LOCKED_for_RT_semaphores() +extern void rt_down(struct semaphore *sem); +extern int rt_down_interruptible(struct semaphore *sem); +extern int rt_down_timeout(struct semaphore *sem, long jiffies); +extern int rt_down_trylock(struct semaphore *sem); +extern void rt_up(struct semaphore *sem); + +#define rt_sem_is_locked(s) rt_mutex_is_locked(&(s)->lock) +#define rt_sema_count(s) atomic_read(&(s)->count) + +extern int __bad_func_type(void); + +#include + +/* + * PICK_SEM_OP() is a small redirector to allow less typing of the lock + * types struct compat_semaphore, struct semaphore, at the front of the + * PICK_FUNCTION macro. + */ +#define PICK_SEM_OP(...) PICK_FUNCTION(struct compat_semaphore *, \ + struct semaphore *, ##__VA_ARGS__) +#define PICK_SEM_OP_RET(...) PICK_FUNCTION_RET(struct compat_semaphore *,\ + struct semaphore *, ##__VA_ARGS__) + +#define sema_init(sem, val) \ + PICK_SEM_OP(compat_sema_init, rt_sema_init, sem, val) + +#define init_MUTEX(sem) PICK_SEM_OP(compat_init_MUTEX, rt_init_MUTEX, sem) + +#define init_MUTEX_LOCKED(sem) \ + PICK_SEM_OP(compat_init_MUTEX_LOCKED, rt_init_MUTEX_LOCKED, sem) + +#define down(sem) PICK_SEM_OP(compat_down, rt_down, sem) + +#define down_timeout(sem, jiff) \ + PICK_SEM_OP_RET(compat_down_timeout, rt_down_timeout, sem, jiff) + +#define down_interruptible(sem) \ + PICK_SEM_OP_RET(compat_down_interruptible, rt_down_interruptible, sem) + +#define down_trylock(sem) \ + PICK_SEM_OP_RET(compat_down_trylock, rt_down_trylock, sem) + +#define up(sem) PICK_SEM_OP(compat_up, rt_up, sem) + +/* + * rwsems: + */ + +#define __RWSEM_INITIALIZER(name) \ + { .lock = __RT_MUTEX_INITIALIZER(name.lock), \ + RW_DEP_MAP_INIT(name) } + +#define DECLARE_RWSEM(lockname) \ + struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) + +extern void __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, + struct lock_class_key *key); + +# define rt_init_rwsem(sem) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_rwsem_init((sem), #sem, &__key); \ +} while (0) + +extern void __dont_do_this_in_rt(struct rw_semaphore *rwsem); + +#define rt_down_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem) +#define rt_up_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem) + +extern void rt_down_write(struct rw_semaphore *rwsem); +extern void +rt_down_read_nested(struct rw_semaphore *rwsem, int subclass); +extern void +rt_down_write_nested(struct rw_semaphore *rwsem, int subclass); +extern void rt_down_read(struct rw_semaphore *rwsem); +extern int rt_down_write_trylock(struct rw_semaphore *rwsem); +extern int rt_down_read_trylock(struct rw_semaphore *rwsem); +extern void rt_up_read(struct rw_semaphore *rwsem); +extern void rt_up_write(struct rw_semaphore *rwsem); +extern void rt_downgrade_write(struct rw_semaphore *rwsem); + +# define rt_rwsem_is_locked(rws) (rt_mutex_is_locked(&(rws)->lock)) + +#define PICK_RWSEM_OP(...) PICK_FUNCTION(struct compat_rw_semaphore *, \ + struct rw_semaphore *, ##__VA_ARGS__) +#define PICK_RWSEM_OP_RET(...) PICK_FUNCTION_RET(struct compat_rw_semaphore *,\ + struct rw_semaphore *, ##__VA_ARGS__) + +#define init_rwsem(rwsem) PICK_RWSEM_OP(compat_init_rwsem, rt_init_rwsem, rwsem) + +#define down_read(rwsem) PICK_RWSEM_OP(compat_down_read, rt_down_read, rwsem) + +#define down_read_non_owner(rwsem) \ + PICK_RWSEM_OP(compat_down_read_non_owner, rt_down_read_non_owner, rwsem) + +#define down_read_trylock(rwsem) \ + PICK_RWSEM_OP_RET(compat_down_read_trylock, rt_down_read_trylock, rwsem) + +#define down_write(rwsem) PICK_RWSEM_OP(compat_down_write, rt_down_write, rwsem) + +#define down_read_nested(rwsem, subclass) \ + PICK_RWSEM_OP(compat_down_read_nested, rt_down_read_nested, \ + rwsem, subclass) + +#define down_write_nested(rwsem, subclass) \ + PICK_RWSEM_OP(compat_down_write_nested, rt_down_write_nested, \ + rwsem, subclass) + +#define down_write_trylock(rwsem) \ + PICK_RWSEM_OP_RET(compat_down_write_trylock, rt_down_write_trylock,\ + rwsem) + +#define up_read(rwsem) PICK_RWSEM_OP(compat_up_read, rt_up_read, rwsem) + +#define up_read_non_owner(rwsem) \ + PICK_RWSEM_OP(compat_up_read_non_owner, rt_up_read_non_owner, rwsem) + +#define up_write(rwsem) PICK_RWSEM_OP(compat_up_write, rt_up_write, rwsem) + +#define downgrade_write(rwsem) \ + PICK_RWSEM_OP(compat_downgrade_write, rt_downgrade_write, rwsem) + +#define rwsem_is_locked(rwsem) \ + PICK_RWSEM_OP_RET(compat_rwsem_is_locked, rt_rwsem_is_locked, rwsem) + +#else +# define preempt_rt 0 +#endif /* CONFIG_PREEMPT_RT */ + +#endif + Index: linux-2.6-tip/include/linux/rtmutex.h =================================================================== --- linux-2.6-tip.orig/include/linux/rtmutex.h +++ linux-2.6-tip/include/linux/rtmutex.h @@ -24,7 +24,7 @@ * @owner: the mutex owner */ struct rt_mutex { - spinlock_t wait_lock; + raw_spinlock_t wait_lock; struct plist_head wait_list; struct task_struct *owner; #ifdef CONFIG_DEBUG_RT_MUTEXES @@ -63,8 +63,8 @@ struct hrtimer_sleeper; #endif #define __RT_MUTEX_INITIALIZER(mutexname) \ - { .wait_lock = __SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \ - , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, mutexname.wait_lock) \ + { .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \ + , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \ , .owner = NULL \ __DEBUG_RT_MUTEX_INITIALIZER(mutexname)} @@ -88,6 +88,8 @@ extern void rt_mutex_destroy(struct rt_m extern void rt_mutex_lock(struct rt_mutex *lock); extern int rt_mutex_lock_interruptible(struct rt_mutex *lock, int detect_deadlock); +extern int rt_mutex_lock_killable(struct rt_mutex *lock, + int detect_deadlock); extern int rt_mutex_timed_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout, int detect_deadlock); @@ -98,7 +100,7 @@ extern void rt_mutex_unlock(struct rt_mu #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ - .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock), \ + .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, &tsk.pi_lock), \ INIT_RT_MUTEX_DEBUG(tsk) #else # define INIT_RT_MUTEXES(tsk) Index: linux-2.6-tip/include/linux/rwsem-spinlock.h =================================================================== --- linux-2.6-tip.orig/include/linux/rwsem-spinlock.h +++ linux-2.6-tip/include/linux/rwsem-spinlock.h @@ -28,7 +28,7 @@ struct rwsem_waiter; * - if activity is -1 then there is one active writer * - if wait_list is not empty, then there are processes waiting for the semaphore */ -struct rw_semaphore { +struct compat_rw_semaphore { __s32 activity; spinlock_t wait_lock; struct list_head wait_list; @@ -43,33 +43,32 @@ struct rw_semaphore { # define __RWSEM_DEP_MAP_INIT(lockname) #endif -#define __RWSEM_INITIALIZER(name) \ -{ 0, __SPIN_LOCK_UNLOCKED(name.wait_lock), LIST_HEAD_INIT((name).wait_list) \ - __RWSEM_DEP_MAP_INIT(name) } +#define __COMPAT_RWSEM_INITIALIZER(name) \ +{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __COMPAT_RWSEM_INITIALIZER(name) -extern void __init_rwsem(struct rw_semaphore *sem, const char *name, +extern void __compat_init_rwsem(struct compat_rw_semaphore *sem, const char *name, struct lock_class_key *key); -#define init_rwsem(sem) \ +#define compat_init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ - __init_rwsem((sem), #sem, &__key); \ + __compat_init_rwsem((sem), #sem, &__key); \ } while (0) -extern void __down_read(struct rw_semaphore *sem); -extern int __down_read_trylock(struct rw_semaphore *sem); -extern void __down_write(struct rw_semaphore *sem); -extern void __down_write_nested(struct rw_semaphore *sem, int subclass); -extern int __down_write_trylock(struct rw_semaphore *sem); -extern void __up_read(struct rw_semaphore *sem); -extern void __up_write(struct rw_semaphore *sem); -extern void __downgrade_write(struct rw_semaphore *sem); +extern void __down_read(struct compat_rw_semaphore *sem); +extern int __down_read_trylock(struct compat_rw_semaphore *sem); +extern void __down_write(struct compat_rw_semaphore *sem); +extern void __down_write_nested(struct compat_rw_semaphore *sem, int subclass); +extern int __down_write_trylock(struct compat_rw_semaphore *sem); +extern void __up_read(struct compat_rw_semaphore *sem); +extern void __up_write(struct compat_rw_semaphore *sem); +extern void __downgrade_write(struct compat_rw_semaphore *sem); -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->activity != 0); } Index: linux-2.6-tip/include/linux/rwsem.h =================================================================== --- linux-2.6-tip.orig/include/linux/rwsem.h +++ linux-2.6-tip/include/linux/rwsem.h @@ -9,53 +9,68 @@ #include +#ifdef CONFIG_PREEMPT_RT +# include +#endif + #include #include #include #include -struct rw_semaphore; +#ifndef CONFIG_PREEMPT_RT +/* + * On !PREEMPT_RT all rw-semaphores are compat: + */ +#define compat_rw_semaphore rw_semaphore +#endif + +struct compat_rw_semaphore; #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK -#include /* use a generic implementation */ +# include /* use a generic implementation */ +# ifndef CONFIG_PREEMPT_RT +# define __RWSEM_INITIALIZER __COMPAT_RWSEM_INITIALIZER +# define DECLARE_RWSEM COMPAT_DECLARE_RWSEM +# endif #else -#include /* use an arch-specific implementation */ +# include /* use an arch-specific implementation */ #endif /* * lock for reading */ -extern void down_read(struct rw_semaphore *sem); +extern void compat_down_read(struct compat_rw_semaphore *sem); /* * trylock for reading -- returns 1 if successful, 0 if contention */ -extern int down_read_trylock(struct rw_semaphore *sem); +extern int compat_down_read_trylock(struct compat_rw_semaphore *sem); /* * lock for writing */ -extern void down_write(struct rw_semaphore *sem); +extern void compat_down_write(struct compat_rw_semaphore *sem); /* * trylock for writing -- returns 1 if successful, 0 if contention */ -extern int down_write_trylock(struct rw_semaphore *sem); +extern int compat_down_write_trylock(struct compat_rw_semaphore *sem); /* * release a read lock */ -extern void up_read(struct rw_semaphore *sem); +extern void compat_up_read(struct compat_rw_semaphore *sem); /* * release a write lock */ -extern void up_write(struct rw_semaphore *sem); +extern void compat_up_write(struct compat_rw_semaphore *sem); /* * downgrade write lock to read lock */ -extern void downgrade_write(struct rw_semaphore *sem); +extern void compat_downgrade_write(struct compat_rw_semaphore *sem); #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -71,21 +86,78 @@ extern void downgrade_write(struct rw_se * lockdep_set_class() at lock initialization time. * See Documentation/lockdep-design.txt for more details.) */ -extern void down_read_nested(struct rw_semaphore *sem, int subclass); -extern void down_write_nested(struct rw_semaphore *sem, int subclass); +extern void +compat_down_read_nested(struct compat_rw_semaphore *sem, int subclass); +extern void +compat_down_write_nested(struct compat_rw_semaphore *sem, int subclass); /* * Take/release a lock when not the owner will release it. * * [ This API should be avoided as much as possible - the * proper abstraction for this case is completions. ] */ -extern void down_read_non_owner(struct rw_semaphore *sem); -extern void up_read_non_owner(struct rw_semaphore *sem); +extern void +compat_down_read_non_owner(struct compat_rw_semaphore *sem); +extern void +compat_up_read_non_owner(struct compat_rw_semaphore *sem); #else -# define down_read_nested(sem, subclass) down_read(sem) -# define down_write_nested(sem, subclass) down_write(sem) -# define down_read_non_owner(sem) down_read(sem) -# define up_read_non_owner(sem) up_read(sem) +# define compat_down_read_nested(sem, subclass) compat_down_read(sem) +# define compat_down_write_nested(sem, subclass) compat_down_write(sem) +# define compat_down_read_non_owner(sem) compat_down_read(sem) +# define compat_up_read_non_owner(sem) compat_up_read(sem) #endif +#ifndef CONFIG_PREEMPT_RT + +#define DECLARE_RWSEM COMPAT_DECLARE_RWSEM + +/* + * NOTE, lockdep: this has to be a macro, so that separate class-keys + * get generated by the compiler, if the same function does multiple + * init_rwsem() calls to different rwsems. + */ +#define init_rwsem(rwsem) compat_init_rwsem(rwsem) + +static inline void down_read(struct compat_rw_semaphore *rwsem) +{ + compat_down_read(rwsem); +} +static inline int down_read_trylock(struct compat_rw_semaphore *rwsem) +{ + return compat_down_read_trylock(rwsem); +} +static inline void down_write(struct compat_rw_semaphore *rwsem) +{ + compat_down_write(rwsem); +} +static inline int down_write_trylock(struct compat_rw_semaphore *rwsem) +{ + return compat_down_write_trylock(rwsem); +} +static inline void up_read(struct compat_rw_semaphore *rwsem) +{ + compat_up_read(rwsem); +} +static inline void up_write(struct compat_rw_semaphore *rwsem) +{ + compat_up_write(rwsem); +} +static inline void downgrade_write(struct compat_rw_semaphore *rwsem) +{ + compat_downgrade_write(rwsem); +} +static inline int rwsem_is_locked(struct compat_rw_semaphore *sem) +{ + return compat_rwsem_is_locked(sem); +} +# define down_read_nested(sem, subclass) \ + compat_down_read_nested(sem, subclass) +# define down_write_nested(sem, subclass) \ + compat_down_write_nested(sem, subclass) +# define down_read_non_owner(sem) \ + compat_down_read_non_owner(sem) +# define up_read_non_owner(sem) \ + compat_up_read_non_owner(sem) +#endif /* !CONFIG_PREEMPT_RT */ + #endif /* _LINUX_RWSEM_H */ Index: linux-2.6-tip/include/linux/semaphore.h =================================================================== --- linux-2.6-tip.orig/include/linux/semaphore.h +++ linux-2.6-tip/include/linux/semaphore.h @@ -9,41 +9,86 @@ #ifndef __LINUX_SEMAPHORE_H #define __LINUX_SEMAPHORE_H -#include -#include +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +# include +# include /* Please don't access any members of this structure directly */ -struct semaphore { +struct compat_semaphore { spinlock_t lock; unsigned int count; struct list_head wait_list; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .lock = __SPIN_LOCK_UNLOCKED((name).lock), \ .count = n, \ .wait_list = LIST_HEAD_INIT((name).wait_list), \ } -#define DECLARE_MUTEX(name) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name, count) -static inline void sema_init(struct semaphore *sem, int val) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, 1) +static inline void compat_sema_init(struct compat_semaphore *sem, int val) { static struct lock_class_key __key; - *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val); + *sem = (struct compat_semaphore) __COMPAT_SEMAPHORE_INITIALIZER(*sem, val); lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0); } -#define init_MUTEX(sem) sema_init(sem, 1) -#define init_MUTEX_LOCKED(sem) sema_init(sem, 0) +#define compat_init_MUTEX(sem) compat_sema_init(sem, 1) +#define compat_init_MUTEX_LOCKED(sem) compat_sema_init(sem, 0) + +extern void compat_down(struct compat_semaphore *sem); +extern int __must_check compat_down_interruptible(struct compat_semaphore *sem); +extern int __must_check compat_down_killable(struct compat_semaphore *sem); +extern int __must_check compat_down_trylock(struct compat_semaphore *sem); +extern int __must_check compat_down_timeout(struct compat_semaphore *sem, long jiffies); +extern void compat_up(struct compat_semaphore *sem); + +#ifdef CONFIG_PREEMPT_RT +# include +#else +#define DECLARE_MUTEX COMPAT_DECLARE_MUTEX -extern void down(struct semaphore *sem); -extern int __must_check down_interruptible(struct semaphore *sem); -extern int __must_check down_killable(struct semaphore *sem); -extern int __must_check down_trylock(struct semaphore *sem); -extern int __must_check down_timeout(struct semaphore *sem, long jiffies); -extern void up(struct semaphore *sem); +static inline void sema_init(struct compat_semaphore *sem, int val) +{ + compat_sema_init(sem, val); +} +static inline void init_MUTEX(struct compat_semaphore *sem) +{ + compat_init_MUTEX(sem); +} +static inline void init_MUTEX_LOCKED(struct compat_semaphore *sem) +{ + compat_init_MUTEX_LOCKED(sem); +} +static inline void down(struct compat_semaphore *sem) +{ + compat_down(sem); +} +static inline int down_interruptible(struct compat_semaphore *sem) +{ + return compat_down_interruptible(sem); +} +static inline int down_trylock(struct compat_semaphore *sem) +{ + return compat_down_trylock(sem); +} +static inline int down_timeout(struct compat_semaphore *sem, long jiffies) +{ + return compat_down_timeout(sem, jiffies); +} + +static inline void up(struct compat_semaphore *sem) +{ + compat_up(sem); +} +#endif /* CONFIG_PREEMPT_RT */ #endif /* __LINUX_SEMAPHORE_H */ Index: linux-2.6-tip/include/linux/seqlock.h =================================================================== --- linux-2.6-tip.orig/include/linux/seqlock.h +++ linux-2.6-tip/include/linux/seqlock.h @@ -3,9 +3,11 @@ /* * Reader/writer consistent mechanism without starving writers. This type of * lock for data where the reader wants a consistent set of information - * and is willing to retry if the information changes. Readers never - * block but they may have to retry if a writer is in - * progress. Writers do not wait for readers. + * and is willing to retry if the information changes. Readers block + * on write contention (and where applicable, pi-boost the writer). + * Readers without contention on entry acquire the critical section + * without any atomic operations, but they may have to retry if a writer + * enters before the critical section ends. Writers do not wait for readers. * * This is not as cache friendly as brlock. Also, this will not work * for data that contains pointers, because any writer could @@ -24,56 +26,110 @@ * * Based on x86_64 vsyscall gettimeofday * by Keith Owens and Andrea Arcangeli + * + * Priority inheritance and live-lock avoidance by Gregory Haskins */ +#include #include #include typedef struct { unsigned sequence; - spinlock_t lock; -} seqlock_t; + rwlock_t lock; +} __seqlock_t; + +typedef struct { + unsigned sequence; + raw_spinlock_t lock; +} __raw_seqlock_t; + +#define seqlock_need_resched(seq) lock_need_resched(&(seq)->lock) + +#ifdef CONFIG_PREEMPT_RT +typedef __seqlock_t seqlock_t; +#else +typedef __raw_seqlock_t seqlock_t; +#endif + +typedef __raw_seqlock_t raw_seqlock_t; /* * These macros triggered gcc-3.x compile-time problems. We think these are * OK now. Be cautious. */ -#define __SEQLOCK_UNLOCKED(lockname) \ - { 0, __SPIN_LOCK_UNLOCKED(lockname) } +#define __RAW_SEQLOCK_UNLOCKED(lockname) \ + { 0, RAW_SPIN_LOCK_UNLOCKED(lockname) } + +#ifdef CONFIG_PREEMPT_RT +# define __SEQLOCK_UNLOCKED(lockname) { 0, __RW_LOCK_UNLOCKED(lockname) } +#else +# define __SEQLOCK_UNLOCKED(lockname) __RAW_SEQLOCK_UNLOCKED(lockname) +#endif #define SEQLOCK_UNLOCKED \ __SEQLOCK_UNLOCKED(old_style_seqlock_init) -#define seqlock_init(x) \ - do { \ - (x)->sequence = 0; \ - spin_lock_init(&(x)->lock); \ - } while (0) +static inline void __raw_seqlock_init(raw_seqlock_t *seqlock) +{ + *seqlock = (raw_seqlock_t) __RAW_SEQLOCK_UNLOCKED(x); + spin_lock_init(&seqlock->lock); +} + +#ifdef CONFIG_PREEMPT_RT +static inline void __seqlock_init(seqlock_t *seqlock) +{ + *seqlock = (seqlock_t) __SEQLOCK_UNLOCKED(seqlock); + rwlock_init(&seqlock->lock); +} +#else +extern void __seqlock_init(seqlock_t *seqlock); +#endif + +#define seqlock_init(seq) \ + PICK_FUNCTION(raw_seqlock_t *, seqlock_t *, \ + __raw_seqlock_init, __seqlock_init, seq); #define DEFINE_SEQLOCK(x) \ seqlock_t x = __SEQLOCK_UNLOCKED(x) +#define DEFINE_RAW_SEQLOCK(name) \ + raw_seqlock_t name __cacheline_aligned_in_smp = \ + __RAW_SEQLOCK_UNLOCKED(name) + + /* Lock out other writers and update the count. * Acts like a normal spin_lock/unlock. * Don't need preempt_disable() because that is in the spin_lock already. */ -static inline void write_seqlock(seqlock_t *sl) +static inline void __write_seqlock(seqlock_t *sl) { - spin_lock(&sl->lock); + write_lock(&sl->lock); ++sl->sequence; smp_wmb(); } -static inline void write_sequnlock(seqlock_t *sl) +static __always_inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) +{ + unsigned long flags; + + local_save_flags(flags); + __write_seqlock(sl); + return flags; +} + +static inline void __write_sequnlock(seqlock_t *sl) { smp_wmb(); sl->sequence++; - spin_unlock(&sl->lock); + write_unlock(&sl->lock); } -static inline int write_tryseqlock(seqlock_t *sl) +#define __write_sequnlock_irqrestore(sl, flags) __write_sequnlock(sl) + +static inline int __write_tryseqlock(seqlock_t *sl) { - int ret = spin_trylock(&sl->lock); + int ret = write_trylock(&sl->lock); if (ret) { ++sl->sequence; @@ -83,18 +139,25 @@ static inline int write_tryseqlock(seqlo } /* Start of read calculation -- fetch last complete writer token */ -static __always_inline unsigned read_seqbegin(const seqlock_t *sl) +static __always_inline unsigned __read_seqbegin(seqlock_t *sl) { unsigned ret; -repeat: ret = sl->sequence; smp_rmb(); if (unlikely(ret & 1)) { - cpu_relax(); - goto repeat; + /* + * Serialze with the writer which will ensure they are + * pi-boosted if necessary and prevent us from starving + * them. + */ + read_lock(&sl->lock); + ret = sl->sequence; + read_unlock(&sl->lock); } + BUG_ON(ret & 1); + return ret; } @@ -103,13 +166,192 @@ repeat: * * If sequence value changed then writer changed data while in section. */ -static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start) +static inline int __read_seqretry(seqlock_t *sl, unsigned iv) +{ + smp_rmb(); + return (sl->sequence != iv); +} + +static __always_inline void __write_seqlock_raw(raw_seqlock_t *sl) +{ + spin_lock(&sl->lock); + ++sl->sequence; + smp_wmb(); +} + +static __always_inline unsigned long +__write_seqlock_irqsave_raw(raw_seqlock_t *sl) +{ + unsigned long flags; + + local_irq_save(flags); + __write_seqlock_raw(sl); + return flags; +} + +static __always_inline void __write_seqlock_irq_raw(raw_seqlock_t *sl) +{ + local_irq_disable(); + __write_seqlock_raw(sl); +} + +static __always_inline void __write_seqlock_bh_raw(raw_seqlock_t *sl) +{ + local_bh_disable(); + __write_seqlock_raw(sl); +} + +static __always_inline void __write_sequnlock_raw(raw_seqlock_t *sl) +{ + smp_wmb(); + sl->sequence++; + spin_unlock(&sl->lock); +} + +static __always_inline void +__write_sequnlock_irqrestore_raw(raw_seqlock_t *sl, unsigned long flags) +{ + __write_sequnlock_raw(sl); + local_irq_restore(flags); + preempt_check_resched(); +} + +static __always_inline void __write_sequnlock_irq_raw(raw_seqlock_t *sl) +{ + __write_sequnlock_raw(sl); + local_irq_enable(); + preempt_check_resched(); +} + +static __always_inline void __write_sequnlock_bh_raw(raw_seqlock_t *sl) +{ + __write_sequnlock_raw(sl); + local_bh_enable(); +} + +static __always_inline int __write_tryseqlock_raw(raw_seqlock_t *sl) +{ + int ret = spin_trylock(&sl->lock); + + if (ret) { + ++sl->sequence; + smp_wmb(); + } + return ret; +} + +static __always_inline unsigned __read_seqbegin_raw(const raw_seqlock_t *sl) +{ + unsigned ret; + +repeat: + ret = sl->sequence; + smp_rmb(); + if (unlikely(ret & 1)) { + cpu_relax(); + goto repeat; + } + + return ret; +} + +static __always_inline int __read_seqretry_raw(const raw_seqlock_t *sl, unsigned start) { smp_rmb(); return (sl->sequence != start); } +extern int __bad_seqlock_type(void); + +/* + * PICK_SEQ_OP() is a small redirector to allow less typing of the lock + * types raw_seqlock_t, seqlock_t, at the front of the PICK_FUNCTION + * macro. + */ +#define PICK_SEQ_OP(...) \ + PICK_FUNCTION(raw_seqlock_t *, seqlock_t *, ##__VA_ARGS__) +#define PICK_SEQ_OP_RET(...) \ + PICK_FUNCTION_RET(raw_seqlock_t *, seqlock_t *, ##__VA_ARGS__) + +#define write_seqlock(sl) PICK_SEQ_OP(__write_seqlock_raw, __write_seqlock, sl) + +#define write_sequnlock(sl) \ + PICK_SEQ_OP(__write_sequnlock_raw, __write_sequnlock, sl) + +#define write_tryseqlock(sl) \ + PICK_SEQ_OP_RET(__write_tryseqlock_raw, __write_tryseqlock, sl) + +#define read_seqbegin(sl) \ + PICK_SEQ_OP_RET(__read_seqbegin_raw, __read_seqbegin, sl) + +#define read_seqretry(sl, iv) \ + PICK_SEQ_OP_RET(__read_seqretry_raw, __read_seqretry, sl, iv) + +#define write_seqlock_irqsave(lock, flags) \ +do { \ + flags = PICK_SEQ_OP_RET(__write_seqlock_irqsave_raw, \ + __write_seqlock_irqsave, lock); \ +} while (0) + +#define write_seqlock_irq(lock) \ + PICK_SEQ_OP(__write_seqlock_irq_raw, __write_seqlock, lock) + +#define write_seqlock_bh(lock) \ + PICK_SEQ_OP(__write_seqlock_bh_raw, __write_seqlock, lock) + +#define write_sequnlock_irqrestore(lock, flags) \ + PICK_SEQ_OP(__write_sequnlock_irqrestore_raw, \ + __write_sequnlock_irqrestore, lock, flags) + +#define write_sequnlock_bh(lock) \ + PICK_SEQ_OP(__write_sequnlock_bh_raw, __write_sequnlock, lock) + +#define write_sequnlock_irq(lock) \ + PICK_SEQ_OP(__write_sequnlock_irq_raw, __write_sequnlock, lock) + +static __always_inline +unsigned long __seq_irqsave_raw(raw_seqlock_t *sl) +{ + unsigned long flags; + + local_irq_save(flags); + return flags; +} + +static __always_inline unsigned long __seq_irqsave(seqlock_t *sl) +{ + unsigned long flags; + + local_save_flags(flags); + return flags; +} + +#define read_seqbegin_irqsave(lock, flags) \ +({ \ + flags = PICK_SEQ_OP_RET(__seq_irqsave_raw, __seq_irqsave, lock);\ + read_seqbegin(lock); \ +}) + +static __always_inline int +__read_seqretry_irqrestore(seqlock_t *sl, unsigned iv, unsigned long flags) +{ + return __read_seqretry(sl, iv); +} + +static __always_inline int +__read_seqretry_irqrestore_raw(raw_seqlock_t *sl, unsigned iv, + unsigned long flags) +{ + int ret = read_seqretry(sl, iv); + local_irq_restore(flags); + preempt_check_resched(); + return ret; +} + +#define read_seqretry_irqrestore(lock, iv, flags) \ + PICK_SEQ_OP_RET(__read_seqretry_irqrestore_raw, \ + __read_seqretry_irqrestore, lock, iv, flags) /* * Version using sequence counter only. @@ -166,32 +408,4 @@ static inline void write_seqcount_end(se smp_wmb(); s->sequence++; } - -/* - * Possible sw/hw IRQ protected versions of the interfaces. - */ -#define write_seqlock_irqsave(lock, flags) \ - do { local_irq_save(flags); write_seqlock(lock); } while (0) -#define write_seqlock_irq(lock) \ - do { local_irq_disable(); write_seqlock(lock); } while (0) -#define write_seqlock_bh(lock) \ - do { local_bh_disable(); write_seqlock(lock); } while (0) - -#define write_sequnlock_irqrestore(lock, flags) \ - do { write_sequnlock(lock); local_irq_restore(flags); } while(0) -#define write_sequnlock_irq(lock) \ - do { write_sequnlock(lock); local_irq_enable(); } while(0) -#define write_sequnlock_bh(lock) \ - do { write_sequnlock(lock); local_bh_enable(); } while(0) - -#define read_seqbegin_irqsave(lock, flags) \ - ({ local_irq_save(flags); read_seqbegin(lock); }) - -#define read_seqretry_irqrestore(lock, iv, flags) \ - ({ \ - int ret = read_seqretry(lock, iv); \ - local_irq_restore(flags); \ - ret; \ - }) - #endif /* __LINUX_SEQLOCK_H */ Index: linux-2.6-tip/include/linux/spinlock.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock.h +++ linux-2.6-tip/include/linux/spinlock.h @@ -44,6 +44,42 @@ * builds the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. + * + * + * Public types and naming conventions: + * ------------------------------------ + * spinlock_t: type: sleep-lock + * raw_spinlock_t: type: spin-lock (debug) + * + * spin_lock([raw_]spinlock_t): API: acquire lock, both types + * + * + * Internal types and naming conventions: + * ------------------------------------- + * __raw_spinlock_t: type: lowlevel spin-lock + * + * _spin_lock(struct rt_mutex): API: acquire sleep-lock + * __spin_lock(raw_spinlock_t): API: acquire spin-lock (highlevel) + * _raw_spin_lock(raw_spinlock_t): API: acquire spin-lock (debug) + * __raw_spin_lock(__raw_spinlock_t): API: acquire spin-lock (lowlevel) + * + * + * spin_lock(raw_spinlock_t) translates into the following chain of + * calls/inlines/macros, if spin-lock debugging is enabled: + * + * spin_lock() [include/linux/spinlock.h] + * -> __spin_lock() [kernel/spinlock.c] + * -> _raw_spin_lock() [lib/spinlock_debug.c] + * -> __raw_spin_lock() [include/asm/spinlock.h] + * + * spin_lock(spinlock_t) translates into the following chain of + * calls/inlines/macros: + * + * spin_lock() [include/linux/spinlock.h] + * -> _spin_lock() [include/linux/spinlock.h] + * -> rt_spin_lock() [kernel/rtmutex.c] + * -> rt_spin_lock_fastlock() [kernel/rtmutex.c] + * -> rt_spin_lock_slowlock() [kernel/rtmutex.c] */ #include @@ -52,29 +88,15 @@ #include #include #include +#include #include #include +#include +#include #include /* - * Must define these before including other files, inline functions need them - */ -#define LOCK_SECTION_NAME ".text.lock."KBUILD_BASENAME - -#define LOCK_SECTION_START(extra) \ - ".subsection 1\n\t" \ - extra \ - ".ifndef " LOCK_SECTION_NAME "\n\t" \ - LOCK_SECTION_NAME ":\n\t" \ - ".endif\n" - -#define LOCK_SECTION_END \ - ".previous\n\t" - -#define __lockfunc __attribute__((section(".spinlock.text"))) - -/* * Pull the raw_spinlock_t and raw_rwlock_t definitions: */ #include @@ -90,36 +112,10 @@ extern int __lockfunc generic__raw_read_ # include #endif -#ifdef CONFIG_DEBUG_SPINLOCK - extern void __spin_lock_init(spinlock_t *lock, const char *name, - struct lock_class_key *key); -# define spin_lock_init(lock) \ -do { \ - static struct lock_class_key __key; \ - \ - __spin_lock_init((lock), #lock, &__key); \ -} while (0) - -#else -# define spin_lock_init(lock) \ - do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0) -#endif - -#ifdef CONFIG_DEBUG_SPINLOCK - extern void __rwlock_init(rwlock_t *lock, const char *name, - struct lock_class_key *key); -# define rwlock_init(lock) \ -do { \ - static struct lock_class_key __key; \ - \ - __rwlock_init((lock), #lock, &__key); \ -} while (0) -#else -# define rwlock_init(lock) \ - do { *(lock) = RW_LOCK_UNLOCKED; } while (0) -#endif - -#define spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) +/* + * Pull the RT types: + */ +#include #ifdef CONFIG_GENERIC_LOCKBREAK #define spin_is_contended(lock) ((lock)->break_lock) @@ -132,12 +128,6 @@ do { \ #endif /*__raw_spin_is_contended*/ #endif -/** - * spin_unlock_wait - wait until the spinlock gets unlocked - * @lock: the spinlock in question. - */ -#define spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) - /* * Pull the _spin_*()/_read_*()/_write_*() functions/declarations: */ @@ -148,16 +138,16 @@ do { \ #endif #ifdef CONFIG_DEBUG_SPINLOCK - extern void _raw_spin_lock(spinlock_t *lock); -#define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock) - extern int _raw_spin_trylock(spinlock_t *lock); - extern void _raw_spin_unlock(spinlock_t *lock); - extern void _raw_read_lock(rwlock_t *lock); - extern int _raw_read_trylock(rwlock_t *lock); - extern void _raw_read_unlock(rwlock_t *lock); - extern void _raw_write_lock(rwlock_t *lock); - extern int _raw_write_trylock(rwlock_t *lock); - extern void _raw_write_unlock(rwlock_t *lock); + extern __lockfunc void _raw_spin_lock(raw_spinlock_t *lock); +# define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock) + extern __lockfunc int _raw_spin_trylock(raw_spinlock_t *lock); + extern __lockfunc void _raw_spin_unlock(raw_spinlock_t *lock); + extern __lockfunc void _raw_read_lock(raw_rwlock_t *lock); + extern __lockfunc int _raw_read_trylock(raw_rwlock_t *lock); + extern __lockfunc void _raw_read_unlock(raw_rwlock_t *lock); + extern __lockfunc void _raw_write_lock(raw_rwlock_t *lock); + extern __lockfunc int _raw_write_trylock(raw_rwlock_t *lock); + extern __lockfunc void _raw_write_unlock(raw_rwlock_t *lock); #else # define _raw_spin_lock(lock) __raw_spin_lock(&(lock)->raw_lock) # define _raw_spin_lock_flags(lock, flags) \ @@ -172,179 +162,440 @@ do { \ # define _raw_write_unlock(rwlock) __raw_write_unlock(&(rwlock)->raw_lock) #endif -#define read_can_lock(rwlock) __raw_read_can_lock(&(rwlock)->raw_lock) -#define write_can_lock(rwlock) __raw_write_can_lock(&(rwlock)->raw_lock) +extern int __bad_spinlock_type(void); +extern int __bad_rwlock_type(void); + +extern void +__rt_spin_lock_init(spinlock_t *lock, char *name, struct lock_class_key *key); + +extern void __lockfunc rt_spin_lock(spinlock_t *lock); +extern void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass); +extern void __lockfunc rt_spin_unlock(spinlock_t *lock); +extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock); +extern int __lockfunc +rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags); +extern int __lockfunc rt_spin_trylock(spinlock_t *lock); +extern int _atomic_dec_and_spin_lock(spinlock_t *lock, atomic_t *atomic); + +/* + * lockdep-less calls, for derived types like rwlock: + * (for trylock they can use rt_mutex_trylock() directly. + */ +extern void __lockfunc __rt_spin_lock(struct rt_mutex *lock); +extern void __lockfunc __rt_spin_unlock(struct rt_mutex *lock); + +#ifdef CONFIG_PREEMPT_RT +# define _spin_lock(l) rt_spin_lock(l) +# define _spin_lock_nested(l, s) rt_spin_lock_nested(l, s) +# define _spin_lock_bh(l) rt_spin_lock(l) +# define _spin_lock_irq(l) rt_spin_lock(l) +# define _spin_unlock(l) rt_spin_unlock(l) +# define _spin_unlock_no_resched(l) rt_spin_unlock(l) +# define _spin_unlock_bh(l) rt_spin_unlock(l) +# define _spin_unlock_irq(l) rt_spin_unlock(l) +# define _spin_unlock_irqrestore(l, f) rt_spin_unlock(l) +static inline unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +{ + rt_spin_lock(lock); + return 0; +} +static inline unsigned long __lockfunc +_spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +{ + rt_spin_lock_nested(lock, subclass); + return 0; +} +#else +static inline unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +{ + return 0; +} +static inline unsigned long __lockfunc +_spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +{ + return 0; +} +# define _spin_lock(l) do { } while (0) +# define _spin_lock_nested(l, s) do { } while (0) +# define _spin_lock_bh(l) do { } while (0) +# define _spin_lock_irq(l) do { } while (0) +# define _spin_unlock(l) do { } while (0) +# define _spin_unlock_no_resched(l) do { } while (0) +# define _spin_unlock_bh(l) do { } while (0) +# define _spin_unlock_irq(l) do { } while (0) +# define _spin_unlock_irqrestore(l, f) do { } while (0) +#endif + +#define _spin_lock_init(sl, n, f, l) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_spin_lock_init(sl, n, &__key); \ +} while (0) + +# ifdef CONFIG_PREEMPT_RT +# define _spin_can_lock(l) (!rt_mutex_is_locked(&(l)->lock)) +# define _spin_is_locked(l) rt_mutex_is_locked(&(l)->lock) +# define _spin_unlock_wait(l) rt_spin_unlock_wait(l) + +# define _spin_trylock(l) rt_spin_trylock(l) +# define _spin_trylock_bh(l) rt_spin_trylock(l) +# define _spin_trylock_irq(l) rt_spin_trylock(l) +# define _spin_trylock_irqsave(l,f) rt_spin_trylock_irqsave(l, f) +# else + + extern int this_should_never_be_called_on_non_rt(spinlock_t *lock); +# define TSNBCONRT(l) this_should_never_be_called_on_non_rt(l) +# define _spin_can_lock(l) TSNBCONRT(l) +# define _spin_is_locked(l) TSNBCONRT(l) +# define _spin_unlock_wait(l) TSNBCONRT(l) + +# define _spin_trylock(l) TSNBCONRT(l) +# define _spin_trylock_bh(l) TSNBCONRT(l) +# define _spin_trylock_irq(l) TSNBCONRT(l) +# define _spin_trylock_irqsave(l,f) TSNBCONRT(l) +#endif + +extern void __lockfunc rt_write_lock(rwlock_t *rwlock); +extern void __lockfunc rt_read_lock(rwlock_t *rwlock); +extern int __lockfunc rt_write_trylock(rwlock_t *rwlock); +extern int __lockfunc rt_write_trylock_irqsave(rwlock_t *trylock, + unsigned long *flags); +extern int __lockfunc rt_read_trylock(rwlock_t *rwlock); +extern void __lockfunc rt_write_unlock(rwlock_t *rwlock); +extern void __lockfunc rt_read_unlock(rwlock_t *rwlock); +extern unsigned long __lockfunc rt_write_lock_irqsave(rwlock_t *rwlock); +extern unsigned long __lockfunc rt_read_lock_irqsave(rwlock_t *rwlock); +extern void +__rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key); + +#define _rwlock_init(rwl, n, f, l) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_rwlock_init(rwl, n, &__key); \ +} while (0) + +#ifdef CONFIG_PREEMPT_RT +# define rt_read_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) +# define rt_write_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) +#else + extern int rt_rwlock_can_lock_never_call_on_non_rt(rwlock_t *rwlock); +# define rt_read_can_lock(rwl) rt_rwlock_can_lock_never_call_on_non_rt(rwl) +# define rt_write_can_lock(rwl) rt_rwlock_can_lock_never_call_on_non_rt(rwl) +#endif + +# define _read_can_lock(rwl) rt_read_can_lock(rwl) +# define _write_can_lock(rwl) rt_write_can_lock(rwl) + +# define _read_trylock(rwl) rt_read_trylock(rwl) +# define _write_trylock(rwl) rt_write_trylock(rwl) +# define _write_trylock_irqsave(rwl, flags) \ + rt_write_trylock_irqsave(rwl, flags) + +# define _read_lock(rwl) rt_read_lock(rwl) +# define _write_lock(rwl) rt_write_lock(rwl) +# define _read_unlock(rwl) rt_read_unlock(rwl) +# define _write_unlock(rwl) rt_write_unlock(rwl) + +# define _read_lock_bh(rwl) rt_read_lock(rwl) +# define _write_lock_bh(rwl) rt_write_lock(rwl) +# define _read_unlock_bh(rwl) rt_read_unlock(rwl) +# define _write_unlock_bh(rwl) rt_write_unlock(rwl) + +# define _read_lock_irq(rwl) rt_read_lock(rwl) +# define _write_lock_irq(rwl) rt_write_lock(rwl) +# define _read_unlock_irq(rwl) rt_read_unlock(rwl) +# define _write_unlock_irq(rwl) rt_write_unlock(rwl) + +# define _read_lock_irqsave(rwl) rt_read_lock_irqsave(rwl) +# define _write_lock_irqsave(rwl) rt_write_lock_irqsave(rwl) + +# define _read_unlock_irqrestore(rwl, f) rt_read_unlock(rwl) +# define _write_unlock_irqrestore(rwl, f) rt_write_unlock(rwl) + +#ifdef CONFIG_DEBUG_SPINLOCK + extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, + struct lock_class_key *key); +# define _raw_spin_lock_init(lock, name, file, line) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_spin_lock_init((lock), #lock, &__key); \ +} while (0) + +#else +#define __raw_spin_lock_init(lock) \ + do { *(lock) = RAW_SPIN_LOCK_UNLOCKED(lock); } while (0) +# define _raw_spin_lock_init(lock, name, file, line) __raw_spin_lock_init(lock) +#endif + +/* + * PICK_SPIN_OP()/PICK_RW_OP() are simple redirectors for PICK_FUNCTION + */ +#define PICK_SPIN_OP(...) \ + PICK_FUNCTION(raw_spinlock_t *, spinlock_t *, ##__VA_ARGS__) +#define PICK_SPIN_OP_RET(...) \ + PICK_FUNCTION_RET(raw_spinlock_t *, spinlock_t *, ##__VA_ARGS__) +#define PICK_RW_OP(...) PICK_FUNCTION(raw_rwlock_t *, rwlock_t *, ##__VA_ARGS__) +#define PICK_RW_OP_RET(...) \ + PICK_FUNCTION_RET(raw_rwlock_t *, rwlock_t *, ##__VA_ARGS__) + +#define spin_lock_init(lock) \ + PICK_SPIN_OP(_raw_spin_lock_init, _spin_lock_init, lock, #lock, \ + __FILE__, __LINE__) + +#ifdef CONFIG_DEBUG_SPINLOCK + extern void __raw_rwlock_init(raw_rwlock_t *lock, const char *name, + struct lock_class_key *key); +# define _raw_rwlock_init(lock, name, file, line) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_rwlock_init((lock), #lock, &__key); \ +} while (0) +#else +#define __raw_rwlock_init(lock) \ + do { *(lock) = RAW_RW_LOCK_UNLOCKED(lock); } while (0) +# define _raw_rwlock_init(lock, name, file, line) __raw_rwlock_init(lock) +#endif + +#define rwlock_init(lock) \ + PICK_RW_OP(_raw_rwlock_init, _rwlock_init, lock, #lock, \ + __FILE__, __LINE__) + +#define __spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) + +#define spin_is_locked(lock) \ + PICK_SPIN_OP_RET(__spin_is_locked, _spin_is_locked, lock) + +#define __spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) + +#define spin_unlock_wait(lock) \ + PICK_SPIN_OP(__spin_unlock_wait, _spin_unlock_wait, lock) /* * Define the various spin_lock and rw_lock methods. Note we define these * regardless of whether CONFIG_SMP or CONFIG_PREEMPT are set. The various * methods are defined as nops in the case they are not required. */ -#define spin_trylock(lock) __cond_lock(lock, _spin_trylock(lock)) -#define read_trylock(lock) __cond_lock(lock, _read_trylock(lock)) -#define write_trylock(lock) __cond_lock(lock, _write_trylock(lock)) +#define spin_trylock(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock, _spin_trylock, lock)) + +#define read_trylock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__read_trylock, _read_trylock, lock)) -#define spin_lock(lock) _spin_lock(lock) +#define write_trylock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__write_trylock, _write_trylock, lock)) + +#define write_trylock_irqsave(lock, flags) \ + __cond_lock(lock, PICK_RW_OP_RET(__write_trylock_irqsave, \ + _write_trylock_irqsave, lock, &flags)) + +#define __spin_can_lock(lock) __raw_spin_can_lock(&(lock)->raw_lock) +#define __read_can_lock(lock) __raw_read_can_lock(&(lock)->raw_lock) +#define __write_can_lock(lock) __raw_write_can_lock(&(lock)->raw_lock) + +#define read_can_lock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__read_can_lock, _read_can_lock, lock)) + +#define write_can_lock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__write_can_lock, _write_can_lock,\ + lock)) + +#define spin_lock(lock) PICK_SPIN_OP(__spin_lock, _spin_lock, lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define spin_lock_nested(lock, subclass) _spin_lock_nested(lock, subclass) -# define spin_lock_nest_lock(lock, nest_lock) \ - do { \ - typecheck(struct lockdep_map *, &(nest_lock)->dep_map);\ - _spin_lock_nest_lock(lock, &(nest_lock)->dep_map); \ - } while (0) +# define spin_lock_nested(lock, subclass) \ + PICK_SPIN_OP(__spin_lock_nested, _spin_lock_nested, lock, subclass) #else -# define spin_lock_nested(lock, subclass) _spin_lock(lock) -# define spin_lock_nest_lock(lock, nest_lock) _spin_lock(lock) +# define spin_lock_nested(lock, subclass) spin_lock(lock) #endif -#define write_lock(lock) _write_lock(lock) -#define read_lock(lock) _read_lock(lock) +#define write_lock(lock) PICK_RW_OP(__write_lock, _write_lock, lock) -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) +#define read_lock(lock) PICK_RW_OP(__read_lock, _read_lock, lock) -#define spin_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - flags = _spin_lock_irqsave(lock); \ - } while (0) -#define read_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - flags = _read_lock_irqsave(lock); \ - } while (0) -#define write_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - flags = _write_lock_irqsave(lock); \ - } while (0) +# define spin_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_SPIN_OP_RET(__spin_lock_irqsave, _spin_lock_irqsave, \ + lock); \ +} while (0) #ifdef CONFIG_DEBUG_LOCK_ALLOC -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - do { \ - typecheck(unsigned long, flags); \ - flags = _spin_lock_irqsave_nested(lock, subclass); \ - } while (0) -#else -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - do { \ - typecheck(unsigned long, flags); \ - flags = _spin_lock_irqsave(lock); \ - } while (0) -#endif - -#else - -#define spin_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _spin_lock_irqsave(lock, flags); \ - } while (0) -#define read_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _read_lock_irqsave(lock, flags); \ - } while (0) -#define write_lock_irqsave(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _write_lock_irqsave(lock, flags); \ - } while (0) -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - spin_lock_irqsave(lock, flags) - -#endif - -#define spin_lock_irq(lock) _spin_lock_irq(lock) -#define spin_lock_bh(lock) _spin_lock_bh(lock) - -#define read_lock_irq(lock) _read_lock_irq(lock) -#define read_lock_bh(lock) _read_lock_bh(lock) - -#define write_lock_irq(lock) _write_lock_irq(lock) -#define write_lock_bh(lock) _write_lock_bh(lock) - -/* - * We inline the unlock functions in the nondebug case: - */ -#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \ - !defined(CONFIG_SMP) -# define spin_unlock(lock) _spin_unlock(lock) -# define read_unlock(lock) _read_unlock(lock) -# define write_unlock(lock) _write_unlock(lock) -# define spin_unlock_irq(lock) _spin_unlock_irq(lock) -# define read_unlock_irq(lock) _read_unlock_irq(lock) -# define write_unlock_irq(lock) _write_unlock_irq(lock) -#else -# define spin_unlock(lock) \ - do {__raw_spin_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define read_unlock(lock) \ - do {__raw_read_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define write_unlock(lock) \ - do {__raw_write_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define spin_unlock_irq(lock) \ -do { \ - __raw_spin_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) -# define read_unlock_irq(lock) \ -do { \ - __raw_read_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) -# define write_unlock_irq(lock) \ -do { \ - __raw_write_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) -#endif - -#define spin_unlock_irqrestore(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _spin_unlock_irqrestore(lock, flags); \ - } while (0) -#define spin_unlock_bh(lock) _spin_unlock_bh(lock) - -#define read_unlock_irqrestore(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _read_unlock_irqrestore(lock, flags); \ - } while (0) -#define read_unlock_bh(lock) _read_unlock_bh(lock) - -#define write_unlock_irqrestore(lock, flags) \ - do { \ - typecheck(unsigned long, flags); \ - _write_unlock_irqrestore(lock, flags); \ - } while (0) -#define write_unlock_bh(lock) _write_unlock_bh(lock) - -#define spin_trylock_bh(lock) __cond_lock(lock, _spin_trylock_bh(lock)) - -#define spin_trylock_irq(lock) \ -({ \ - local_irq_disable(); \ - spin_trylock(lock) ? \ - 1 : ({ local_irq_enable(); 0; }); \ -}) +# define spin_lock_irqsave_nested(lock, flags, subclass) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_SPIN_OP_RET(__spin_lock_irqsave_nested, \ + _spin_lock_irqsave_nested, lock, subclass); \ +} while (0) +#else +# define spin_lock_irqsave_nested(lock, flags, subclass) \ + spin_lock_irqsave(lock, flags) +#endif + +# define read_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_RW_OP_RET(__read_lock_irqsave, _read_lock_irqsave, lock);\ +} while (0) + +# define write_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_RW_OP_RET(__write_lock_irqsave, _write_lock_irqsave,lock);\ +} while (0) + +#define spin_lock_irq(lock) PICK_SPIN_OP(__spin_lock_irq, _spin_lock_irq, lock) + +#define spin_lock_bh(lock) PICK_SPIN_OP(__spin_lock_bh, _spin_lock_bh, lock) + +#define read_lock_irq(lock) PICK_RW_OP(__read_lock_irq, _read_lock_irq, lock) + +#define read_lock_bh(lock) PICK_RW_OP(__read_lock_bh, _read_lock_bh, lock) + +#define write_lock_irq(lock) PICK_RW_OP(__write_lock_irq, _write_lock_irq, lock) + +#define write_lock_bh(lock) PICK_RW_OP(__write_lock_bh, _write_lock_bh, lock) + +#define spin_unlock(lock) PICK_SPIN_OP(__spin_unlock, _spin_unlock, lock) + +#define read_unlock(lock) PICK_RW_OP(__read_unlock, _read_unlock, lock) + +#define write_unlock(lock) PICK_RW_OP(__write_unlock, _write_unlock, lock) + +#define spin_unlock_no_resched(lock) \ + PICK_SPIN_OP(__spin_unlock_no_resched, _spin_unlock_no_resched, lock) + +#define spin_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_SPIN_OP(__spin_unlock_irqrestore, _spin_unlock_irqrestore, \ + lock, flags); \ +} while (0) + +#define spin_unlock_irq(lock) \ + PICK_SPIN_OP(__spin_unlock_irq, _spin_unlock_irq, lock) +#define spin_unlock_bh(lock) \ + PICK_SPIN_OP(__spin_unlock_bh, _spin_unlock_bh, lock) + +#define read_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP(__read_unlock_irqrestore, _read_unlock_irqrestore, \ + lock, flags); \ +} while (0) + +#define read_unlock_irq(lock) \ + PICK_RW_OP(__read_unlock_irq, _read_unlock_irq, lock) +#define read_unlock_bh(lock) PICK_RW_OP(__read_unlock_bh, _read_unlock_bh, lock) + +#define write_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP(__write_unlock_irqrestore, _write_unlock_irqrestore, \ + lock, flags); \ +} while (0) +#define write_unlock_irq(lock) \ + PICK_RW_OP(__write_unlock_irq, _write_unlock_irq, lock) + +#define write_unlock_bh(lock) \ + PICK_RW_OP(__write_unlock_bh, _write_unlock_bh, lock) + +#define spin_trylock_bh(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_bh, _spin_trylock_bh,\ + lock)) + +#define spin_trylock_irq(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq, \ + _spin_trylock_irq, lock)) #define spin_trylock_irqsave(lock, flags) \ -({ \ - local_irq_save(flags); \ - spin_trylock(lock) ? \ - 1 : ({ local_irq_restore(flags); 0; }); \ -}) + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave, \ + _spin_trylock_irqsave, lock, &flags)) -#define write_trylock_irqsave(lock, flags) \ -({ \ - local_irq_save(flags); \ - write_trylock(lock) ? \ - 1 : ({ local_irq_restore(flags); 0; }); \ -}) +/* + * bit-based spin_lock() + * + * Don't use this unless you really need to: spin_lock() and spin_unlock() + * are significantly faster. + */ +static inline void bit_spin_lock(int bitnum, unsigned long *addr) +{ + /* + * Assuming the lock is uncontended, this never enters + * the body of the outer loop. If it is contended, then + * within the inner loop a non-atomic test is used to + * busywait with less bus contention for a good time to + * attempt to acquire the lock bit. + */ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + while (unlikely(test_and_set_bit_lock(bitnum, addr))) + while (test_bit(bitnum, addr)) + cpu_relax(); +#endif + __acquire(bitlock); +} + +/* + * Return true if it was acquired + */ +static inline int bit_spin_trylock(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + if (unlikely(test_and_set_bit_lock(bitnum, addr))) + return 0; +#endif + __acquire(bitlock); + return 1; +} + +/* + * bit-based spin_unlock(): + */ +static inline void bit_spin_unlock(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) +# ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(!test_bit(bitnum, addr)); +# endif + clear_bit_unlock(bitnum, addr); +#endif + __release(bitlock); +} + +/* + * bit-based spin_unlock() - non-atomic version: + */ +static inline void __bit_spin_unlock(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) +# ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(!test_bit(bitnum, addr)); +# endif + __clear_bit_unlock(bitnum, addr); +#endif + __release(bitlock); +} + +/* + * Return true if the lock is held. + */ +static inline int bit_spin_is_locked(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + return test_bit(bitnum, addr); +#else + return 1; +#endif +} + +/** + * __raw_spin_can_lock - would __raw_spin_trylock() succeed? + * @lock: the spinlock in question. + */ +#define __raw_spin_can_lock(lock) (!__raw_spin_is_locked(lock)) /* * Pull the atomic_t declaration: @@ -359,14 +610,25 @@ do { \ * Decrements @atomic by 1. If the result is 0, returns true and locks * @lock. Returns false for all other cases. */ -extern int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); -#define atomic_dec_and_lock(atomic, lock) \ - __cond_lock(lock, _atomic_dec_and_lock(atomic, lock)) +/* "lock on reference count zero" */ +#ifndef ATOMIC_DEC_AND_LOCK +# include + extern int __atomic_dec_and_spin_lock(raw_spinlock_t *lock, atomic_t *atomic); +#endif + +#define atomic_dec_and_lock(atomic, lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__atomic_dec_and_spin_lock, \ + _atomic_dec_and_spin_lock, lock, atomic)) /** * spin_can_lock - would spin_trylock() succeed? * @lock: the spinlock in question. */ -#define spin_can_lock(lock) (!spin_is_locked(lock)) +#define spin_can_lock(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_can_lock, _spin_can_lock,\ + lock)) + +/* FIXME: porting hack! */ +#define spin_lock_nest_lock(lock, nest_lock) spin_lock_nested(lock, 0) #endif /* __LINUX_SPINLOCK_H */ Index: linux-2.6-tip/include/linux/spinlock_api_smp.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock_api_smp.h +++ linux-2.6-tip/include/linux/spinlock_api_smp.h @@ -19,45 +19,60 @@ int in_lock_functions(unsigned long addr #define assert_spin_locked(x) BUG_ON(!spin_is_locked(x)) -void __lockfunc _spin_lock(spinlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass) +void __lockfunc __spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map) __acquires(lock); -void __lockfunc _spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *map) - __acquires(lock); -void __lockfunc _read_lock(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock(rwlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_bh(spinlock_t *lock) __acquires(lock); -void __lockfunc _read_lock_bh(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock_bh(rwlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_irq(spinlock_t *lock) __acquires(lock); -void __lockfunc _read_lock_irq(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock_irq(rwlock_t *lock) __acquires(lock); -unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) - __acquires(lock); -unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass) - __acquires(lock); -unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock) - __acquires(lock); -unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock) - __acquires(lock); -int __lockfunc _spin_trylock(spinlock_t *lock); -int __lockfunc _read_trylock(rwlock_t *lock); -int __lockfunc _write_trylock(rwlock_t *lock); -int __lockfunc _spin_trylock_bh(spinlock_t *lock); -void __lockfunc _spin_unlock(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_bh(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock_bh(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock_bh(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_irq(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock_irq(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock_irq(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) - __releases(lock); -void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags) - __releases(lock); -void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags) - __releases(lock); +#define ACQUIRE_SPIN __acquires(lock) +#define ACQUIRE_RW __acquires(lock) +#define RELEASE_SPIN __releases(lock) +#define RELEASE_RW __releases(lock) + +void __lockfunc __spin_lock(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __spin_lock_nested(raw_spinlock_t *lock, int subclass) + ACQUIRE_SPIN; +void __lockfunc __read_lock(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __spin_lock_bh(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __read_lock_bh(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock_bh(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __spin_lock_irq(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __read_lock_irq(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock_irq(raw_rwlock_t *lock) ACQUIRE_RW; +unsigned long __lockfunc __spin_lock_irqsave(raw_spinlock_t *lock) + ACQUIRE_SPIN; +unsigned long __lockfunc +__spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) ACQUIRE_SPIN; +unsigned long __lockfunc __read_lock_irqsave(raw_rwlock_t *lock) + ACQUIRE_RW; +unsigned long __lockfunc __write_lock_irqsave(raw_rwlock_t *lock) + ACQUIRE_RW; +int __lockfunc __spin_trylock(raw_spinlock_t *lock); +int __lockfunc +__spin_trylock_irqsave(raw_spinlock_t *lock, unsigned long *flags); +int __lockfunc __read_trylock(raw_rwlock_t *lock); +int __lockfunc __write_trylock(raw_rwlock_t *lock); +int __lockfunc +__write_trylock_irqsave(raw_rwlock_t *lock, unsigned long *flags); +int __lockfunc __spin_trylock_bh(raw_spinlock_t *lock); +int __lockfunc __spin_trylock_irq(raw_spinlock_t *lock); +void __lockfunc __spin_unlock(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __spin_unlock_no_resched(raw_spinlock_t *lock) + RELEASE_SPIN; +void __lockfunc __read_unlock(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __spin_unlock_bh(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __read_unlock_bh(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock_bh(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __spin_unlock_irq(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __read_unlock_irq(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock_irq(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc +__spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) + RELEASE_SPIN; +void __lockfunc +__read_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) + RELEASE_RW; +void +__lockfunc __write_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) + RELEASE_RW; #endif /* __LINUX_SPINLOCK_API_SMP_H */ Index: linux-2.6-tip/include/linux/spinlock_api_up.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock_api_up.h +++ linux-2.6-tip/include/linux/spinlock_api_up.h @@ -33,12 +33,22 @@ #define __LOCK_IRQ(lock) \ do { local_irq_disable(); __LOCK(lock); } while (0) -#define __LOCK_IRQSAVE(lock, flags) \ - do { local_irq_save(flags); __LOCK(lock); } while (0) +#define __LOCK_IRQSAVE(lock) \ + ({ unsigned long __flags; local_irq_save(__flags); __LOCK(lock); __flags; }) + +#define __TRYLOCK_IRQSAVE(lock, flags) \ + ({ local_irq_save(*(flags)); __LOCK(lock); 1; }) + +#define __spin_trylock_irqsave(lock, flags) __TRYLOCK_IRQSAVE(lock, flags) + +#define __write_trylock_irqsave(lock, flags) __TRYLOCK_IRQSAVE(lock, flags) #define __UNLOCK(lock) \ do { preempt_enable(); __release(lock); (void)(lock); } while (0) +#define __UNLOCK_NO_RESCHED(lock) \ + do { __preempt_enable_no_resched(); __release(lock); (void)(lock); } while (0) + #define __UNLOCK_BH(lock) \ do { preempt_enable_no_resched(); local_bh_enable(); __release(lock); (void)(lock); } while (0) @@ -48,34 +58,36 @@ #define __UNLOCK_IRQRESTORE(lock, flags) \ do { local_irq_restore(flags); __UNLOCK(lock); } while (0) -#define _spin_lock(lock) __LOCK(lock) -#define _spin_lock_nested(lock, subclass) __LOCK(lock) -#define _read_lock(lock) __LOCK(lock) -#define _write_lock(lock) __LOCK(lock) -#define _spin_lock_bh(lock) __LOCK_BH(lock) -#define _read_lock_bh(lock) __LOCK_BH(lock) -#define _write_lock_bh(lock) __LOCK_BH(lock) -#define _spin_lock_irq(lock) __LOCK_IRQ(lock) -#define _read_lock_irq(lock) __LOCK_IRQ(lock) -#define _write_lock_irq(lock) __LOCK_IRQ(lock) -#define _spin_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _read_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _write_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _spin_trylock(lock) ({ __LOCK(lock); 1; }) -#define _read_trylock(lock) ({ __LOCK(lock); 1; }) -#define _write_trylock(lock) ({ __LOCK(lock); 1; }) -#define _spin_trylock_bh(lock) ({ __LOCK_BH(lock); 1; }) -#define _spin_unlock(lock) __UNLOCK(lock) -#define _read_unlock(lock) __UNLOCK(lock) -#define _write_unlock(lock) __UNLOCK(lock) -#define _spin_unlock_bh(lock) __UNLOCK_BH(lock) -#define _write_unlock_bh(lock) __UNLOCK_BH(lock) -#define _read_unlock_bh(lock) __UNLOCK_BH(lock) -#define _spin_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _read_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _write_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _spin_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) -#define _read_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) -#define _write_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __spin_lock(lock) __LOCK(lock) +#define __spin_lock_nested(lock, subclass) __LOCK(lock) +#define __read_lock(lock) __LOCK(lock) +#define __write_lock(lock) __LOCK(lock) +#define __spin_lock_bh(lock) __LOCK_BH(lock) +#define __read_lock_bh(lock) __LOCK_BH(lock) +#define __write_lock_bh(lock) __LOCK_BH(lock) +#define __spin_lock_irq(lock) __LOCK_IRQ(lock) +#define __read_lock_irq(lock) __LOCK_IRQ(lock) +#define __write_lock_irq(lock) __LOCK_IRQ(lock) +#define __spin_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __read_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __write_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __spin_trylock(lock) ({ __LOCK(lock); 1; }) +#define __read_trylock(lock) ({ __LOCK(lock); 1; }) +#define __write_trylock(lock) ({ __LOCK(lock); 1; }) +#define __spin_trylock_bh(lock) ({ __LOCK_BH(lock); 1; }) +#define __spin_trylock_irq(lock) ({ __LOCK_IRQ(lock); 1; }) +#define __spin_unlock(lock) __UNLOCK(lock) +#define __spin_unlock_no_resched(lock) __UNLOCK_NO_RESCHED(lock) +#define __read_unlock(lock) __UNLOCK(lock) +#define __write_unlock(lock) __UNLOCK(lock) +#define __spin_unlock_bh(lock) __UNLOCK_BH(lock) +#define __write_unlock_bh(lock) __UNLOCK_BH(lock) +#define __read_unlock_bh(lock) __UNLOCK_BH(lock) +#define __spin_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __read_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __write_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __spin_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __read_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __write_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) #endif /* __LINUX_SPINLOCK_API_UP_H */ Index: linux-2.6-tip/include/linux/spinlock_types.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock_types.h +++ linux-2.6-tip/include/linux/spinlock_types.h @@ -15,10 +15,27 @@ # include #endif +/* + * Must define these before including other files, inline functions need them + */ +#define LOCK_SECTION_NAME ".text.lock."KBUILD_BASENAME + +#define LOCK_SECTION_START(extra) \ + ".subsection 1\n\t" \ + extra \ + ".ifndef " LOCK_SECTION_NAME "\n\t" \ + LOCK_SECTION_NAME ":\n\t" \ + ".endif\n" + +#define LOCK_SECTION_END \ + ".previous\n\t" + +#define __lockfunc __attribute__((section(".spinlock.text"))) + #include typedef struct { - raw_spinlock_t raw_lock; + __raw_spinlock_t raw_lock; #ifdef CONFIG_GENERIC_LOCKBREAK unsigned int break_lock; #endif @@ -29,12 +46,12 @@ typedef struct { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif -} spinlock_t; +} raw_spinlock_t; #define SPINLOCK_MAGIC 0xdead4ead typedef struct { - raw_rwlock_t raw_lock; + __raw_rwlock_t raw_lock; #ifdef CONFIG_GENERIC_LOCKBREAK unsigned int break_lock; #endif @@ -45,7 +62,7 @@ typedef struct { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif -} rwlock_t; +} raw_rwlock_t; #define RWLOCK_MAGIC 0xdeaf1eed @@ -64,24 +81,24 @@ typedef struct { #endif #ifdef CONFIG_DEBUG_SPINLOCK -# define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ +# define _RAW_SPIN_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ .magic = SPINLOCK_MAGIC, \ .owner = SPINLOCK_OWNER_INIT, \ .owner_cpu = -1, \ SPIN_DEP_MAP_INIT(lockname) } -#define __RW_LOCK_UNLOCKED(lockname) \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ +#define _RAW_RW_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ .magic = RWLOCK_MAGIC, \ .owner = SPINLOCK_OWNER_INIT, \ .owner_cpu = -1, \ RW_DEP_MAP_INIT(lockname) } #else -# define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ +# define _RAW_SPIN_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ SPIN_DEP_MAP_INIT(lockname) } -#define __RW_LOCK_UNLOCKED(lockname) \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ +# define _RAW_RW_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ RW_DEP_MAP_INIT(lockname) } #endif @@ -91,10 +108,22 @@ typedef struct { * Please use DEFINE_SPINLOCK()/DEFINE_RWLOCK() or * __SPIN_LOCK_UNLOCKED()/__RW_LOCK_UNLOCKED() as appropriate. */ -#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(old_style_spin_init) -#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(old_style_rw_init) -#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x) -#define DEFINE_RWLOCK(x) rwlock_t x = __RW_LOCK_UNLOCKED(x) +# define RAW_SPIN_LOCK_UNLOCKED(lockname) \ + (raw_spinlock_t) _RAW_SPIN_LOCK_UNLOCKED(lockname) + +# define RAW_RW_LOCK_UNLOCKED(lockname) \ + (raw_rwlock_t) _RAW_RW_LOCK_UNLOCKED(lockname) + +#define DEFINE_RAW_SPINLOCK(name) \ + raw_spinlock_t name __cacheline_aligned_in_smp = \ + RAW_SPIN_LOCK_UNLOCKED(name) + +#define __DEFINE_RAW_SPINLOCK(name) \ + raw_spinlock_t name = RAW_SPIN_LOCK_UNLOCKED(name) + +#define DEFINE_RAW_RWLOCK(name) \ + raw_rwlock_t name __cacheline_aligned_in_smp = \ + RAW_RW_LOCK_UNLOCKED(name) #endif /* __LINUX_SPINLOCK_TYPES_H */ Index: linux-2.6-tip/include/linux/spinlock_types_up.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock_types_up.h +++ linux-2.6-tip/include/linux/spinlock_types_up.h @@ -16,13 +16,13 @@ typedef struct { volatile unsigned int slock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 1 } #else -typedef struct { } raw_spinlock_t; +typedef struct { } __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { } @@ -30,7 +30,7 @@ typedef struct { } raw_spinlock_t; typedef struct { /* no debug version on UP */ -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { } Index: linux-2.6-tip/include/linux/spinlock_up.h =================================================================== --- linux-2.6-tip.orig/include/linux/spinlock_up.h +++ linux-2.6-tip/include/linux/spinlock_up.h @@ -20,19 +20,19 @@ #ifdef CONFIG_DEBUG_SPINLOCK #define __raw_spin_is_locked(x) ((x)->slock == 0) -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline void __raw_spin_lock(__raw_spinlock_t *lock) { lock->slock = 0; } static inline void -__raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +__raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { local_irq_save(flags); lock->slock = 0; } -static inline int __raw_spin_trylock(raw_spinlock_t *lock) +static inline int __raw_spin_trylock(__raw_spinlock_t *lock) { char oldval = lock->slock; @@ -41,7 +41,7 @@ static inline int __raw_spin_trylock(raw return oldval > 0; } -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { lock->slock = 1; } Index: linux-2.6-tip/kernel/futex.c =================================================================== --- linux-2.6-tip.orig/kernel/futex.c +++ linux-2.6-tip/kernel/futex.c @@ -910,8 +910,12 @@ retry: plist_add(&this->list, &hb2->chain); this->lock_ptr = &hb2->lock; #ifdef CONFIG_DEBUG_PI_LIST +#ifdef CONFIG_PREEMPT_RT + this->list.plist.lock = NULL; +#else this->list.plist.lock = &hb2->lock; #endif +#endif } this->key = key2; get_futex_key_refs(&key2); @@ -970,8 +974,12 @@ static inline void queue_me(struct futex plist_node_init(&q->list, prio); #ifdef CONFIG_DEBUG_PI_LIST +#ifdef CONFIG_PREEMPT_RT + q->list.plist.lock = NULL; +#else q->list.plist.lock = &hb->lock; #endif +#endif plist_add(&q->list, &hb->chain); q->task = current; spin_unlock(&hb->lock); @@ -1245,6 +1253,10 @@ retry: * q.lock_ptr != 0 is not safe, because of ordering against wakeup. */ if (likely(!plist_node_empty(&q.list))) { + unsigned long nosched_flag = current->flags & PF_NOSCHED; + + current->flags &= ~PF_NOSCHED; + if (!abs_time) schedule(); else { @@ -1278,6 +1290,8 @@ retry: destroy_hrtimer_on_stack(&t.timer); } + + current->flags |= nosched_flag; } __set_current_state(TASK_RUNNING); @@ -2032,7 +2046,11 @@ static int __init futex_init(void) futex_cmpxchg_enabled = 1; for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { +#ifdef CONFIG_PREEMPT_RT + plist_head_init(&futex_queues[i].chain, NULL); +#else plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); +#endif spin_lock_init(&futex_queues[i].lock); } Index: linux-2.6-tip/kernel/rt.c =================================================================== --- /dev/null +++ linux-2.6-tip/kernel/rt.c @@ -0,0 +1,589 @@ +/* + * kernel/rt.c + * + * Real-Time Preemption Support + * + * started by Ingo Molnar: + * + * Copyright (C) 2004-2006 Red Hat, Inc., Ingo Molnar + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner + * + * historic credit for proving that Linux spinlocks can be implemented via + * RT-aware mutexes goes to many people: The Pmutex project (Dirk Grambow + * and others) who prototyped it on 2.4 and did lots of comparative + * research and analysis; TimeSys, for proving that you can implement a + * fully preemptible kernel via the use of IRQ threading and mutexes; + * Bill Huey for persuasively arguing on lkml that the mutex model is the + * right one; and to MontaVista, who ported pmutexes to 2.6. + * + * This code is a from-scratch implementation and is not based on pmutexes, + * but the idea of converting spinlocks to mutexes is used here too. + * + * lock debugging, locking tree, deadlock detection: + * + * Copyright (C) 2004, LynuxWorks, Inc., Igor Manyilov, Bill Huey + * Released under the General Public License (GPL). + * + * Includes portions of the generic R/W semaphore implementation from: + * + * Copyright (c) 2001 David Howells (dhowells@redhat.com). + * - Derived partially from idea by Andrea Arcangeli + * - Derived also from comments by Linus + * + * Pending ownership of locks and ownership stealing: + * + * Copyright (C) 2005, Kihon Technologies Inc., Steven Rostedt + * + * (also by Steven Rostedt) + * - Converted single pi_lock to individual task locks. + * + * By Esben Nielsen: + * Doing priority inheritance with help of the scheduler. + * + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner + * - major rework based on Esben Nielsens initial patch + * - replaced thread_info references by task_struct refs + * - removed task->pending_owner dependency + * - BKL drop/reacquire for semaphore style locks to avoid deadlocks + * in the scheduler return path as discussed with Steven Rostedt + * + * Copyright (C) 2006, Kihon Technologies Inc. + * Steven Rostedt + * - debugged and patched Thomas Gleixner's rework. + * - added back the cmpxchg to the rework. + * - turned atomic require back on for SMP. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "rtmutex_common.h" + +#ifdef CONFIG_PREEMPT_RT +/* + * Unlock these on crash: + */ +void zap_rt_locks(void) +{ + //trace_lock_init(); +} +#endif + +/* + * struct mutex functions + */ +void __mutex_init(struct mutex *lock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&lock->lock, name); +} +EXPORT_SYMBOL(__mutex_init); + +void __lockfunc _mutex_lock(struct mutex *lock) +{ + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + rt_mutex_lock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_lock); + +int __lockfunc _mutex_lock_interruptible(struct mutex *lock) +{ + int ret; + + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + ret = rt_mutex_lock_interruptible(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible); + +int __lockfunc _mutex_lock_killable(struct mutex *lock) +{ + int ret; + + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + ret = rt_mutex_lock_killable(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_killable); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) +{ + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + rt_mutex_lock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_lock_nested); + +int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass) +{ + int ret; + + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + ret = rt_mutex_lock_interruptible(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible_nested); + +int __lockfunc _mutex_lock_killable_nested(struct mutex *lock, int subclass) +{ + int ret; + + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + ret = rt_mutex_lock_killable(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_killable_nested); +#endif + +int __lockfunc _mutex_trylock(struct mutex *lock) +{ + int ret = rt_mutex_trylock(&lock->lock); + + if (ret) + mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(_mutex_trylock); + +void __lockfunc _mutex_unlock(struct mutex *lock) +{ + mutex_release(&lock->dep_map, 1, _RET_IP_); + rt_mutex_unlock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_unlock); + +/* + * rwlock_t functions + */ +int __lockfunc rt_write_trylock(rwlock_t *rwlock) +{ + int ret = rt_mutex_trylock(&rwlock->lock); + + if (ret) + rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_write_trylock); + +int __lockfunc rt_write_trylock_irqsave(rwlock_t *rwlock, unsigned long *flags) +{ + *flags = 0; + return rt_write_trylock(rwlock); +} +EXPORT_SYMBOL(rt_write_trylock_irqsave); + +int __lockfunc rt_read_trylock(rwlock_t *rwlock) +{ + struct rt_mutex *lock = &rwlock->lock; + unsigned long flags; + int ret; + + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth++; + rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); + return 1; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + + ret = rt_mutex_trylock(lock); + if (ret) + rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_read_trylock); + +void __lockfunc rt_write_lock(rwlock_t *rwlock) +{ + rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); + __rt_spin_lock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_write_lock); + +void __lockfunc rt_read_lock(rwlock_t *rwlock) +{ + unsigned long flags; + struct rt_mutex *lock = &rwlock->lock; + + rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_); + /* + * Read locks within the write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth++; + return; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + __rt_spin_lock(lock); +} + +EXPORT_SYMBOL(rt_read_lock); + +void __lockfunc rt_write_unlock(rwlock_t *rwlock) +{ + /* NOTE: we always pass in '1' for nested, for simplicity */ + rwlock_release(&rwlock->dep_map, 1, _RET_IP_); + __rt_spin_unlock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_write_unlock); + +void __lockfunc rt_read_unlock(rwlock_t *rwlock) +{ + struct rt_mutex *lock = &rwlock->lock; + unsigned long flags; + + rwlock_release(&rwlock->dep_map, 1, _RET_IP_); + // TRACE_WARN_ON(lock->save_state != 1); + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current && rwlock->read_depth) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth--; + return; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + __rt_spin_unlock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_read_unlock); + +unsigned long __lockfunc rt_write_lock_irqsave(rwlock_t *rwlock) +{ + rt_write_lock(rwlock); + + return 0; +} +EXPORT_SYMBOL(rt_write_lock_irqsave); + +unsigned long __lockfunc rt_read_lock_irqsave(rwlock_t *rwlock) +{ + rt_read_lock(rwlock); + + return 0; +} +EXPORT_SYMBOL(rt_read_lock_irqsave); + +void __rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)rwlock, sizeof(*rwlock)); + lockdep_init_map(&rwlock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&rwlock->lock, name); + rwlock->read_depth = 0; +} +EXPORT_SYMBOL(__rt_rwlock_init); + +/* + * rw_semaphores + */ + +void rt_up_write(struct rw_semaphore *rwsem) +{ + rwsem_release(&rwsem->dep_map, 1, _RET_IP_); + rt_mutex_unlock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_up_write); + +void rt_up_read(struct rw_semaphore *rwsem) +{ + unsigned long flags; + + rwsem_release(&rwsem->dep_map, 1, _RET_IP_); + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth--; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_unlock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_up_read); + +/* + * downgrade a write lock into a read lock + * - just wake up any readers at the front of the queue + */ +void rt_downgrade_write(struct rw_semaphore *rwsem) +{ + BUG(); +} +EXPORT_SYMBOL(rt_downgrade_write); + +int rt_down_write_trylock(struct rw_semaphore *rwsem) +{ + int ret = rt_mutex_trylock(&rwsem->lock); + + if (ret) + rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(rt_down_write_trylock); + +void rt_down_write(struct rw_semaphore *rwsem) +{ + rwsem_acquire(&rwsem->dep_map, 0, 0, _RET_IP_); + rt_mutex_lock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_down_write); + +void rt_down_write_nested(struct rw_semaphore *rwsem, int subclass) +{ + rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); + rt_mutex_lock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_down_write_nested); + +int rt_down_read_trylock(struct rw_semaphore *rwsem) +{ + unsigned long flags; + int ret; + + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + if (rt_mutex_real_owner(&rwsem->lock) == current) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem_acquire_read(&rwsem->dep_map, 0, 1, _RET_IP_); + rwsem->read_depth++; + return 1; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + + ret = rt_mutex_trylock(&rwsem->lock); + if (ret) + rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(rt_down_read_trylock); + +static void __rt_down_read(struct rw_semaphore *rwsem, int subclass) +{ + unsigned long flags; + + rwsem_acquire_read(&rwsem->dep_map, subclass, 0, _RET_IP_); + + /* + * Read locks within the write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + + if (rt_mutex_real_owner(&rwsem->lock) == current) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth++; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_lock(&rwsem->lock); +} + +void rt_down_read(struct rw_semaphore *rwsem) +{ + __rt_down_read(rwsem, 0); +} +EXPORT_SYMBOL(rt_down_read); + +void rt_down_read_nested(struct rw_semaphore *rwsem, int subclass) +{ + __rt_down_read(rwsem, subclass); +} +EXPORT_SYMBOL(rt_down_read_nested); + +void __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)rwsem, sizeof(*rwsem)); + lockdep_init_map(&rwsem->dep_map, name, key, 0); +#endif + __rt_mutex_init(&rwsem->lock, name); + rwsem->read_depth = 0; +} +EXPORT_SYMBOL(__rt_rwsem_init); + +/* + * Semaphores + */ +/* + * Linux Semaphores implemented via RT-mutexes. + * + * In the down() variants we use the mutex as the semaphore blocking + * object: we always acquire it, decrease the counter and keep the lock + * locked if we did the 1->0 transition. The next down() will then block. + * + * In the up() path we atomically increase the counter and do the + * unlock if we were the one doing the 0->1 transition. + */ + +static inline void __down_complete(struct semaphore *sem) +{ + int count = atomic_dec_return(&sem->count); + + if (unlikely(count > 0)) + rt_mutex_unlock(&sem->lock); +} + +void rt_down(struct semaphore *sem) +{ + rt_mutex_lock(&sem->lock); + __down_complete(sem); +} +EXPORT_SYMBOL(rt_down); + +int rt_down_interruptible(struct semaphore *sem) +{ + int ret; + + ret = rt_mutex_lock_interruptible(&sem->lock, 0); + if (ret) + return ret; + __down_complete(sem); + return 0; +} +EXPORT_SYMBOL(rt_down_interruptible); + +int rt_down_timeout(struct semaphore *sem, long jiff) +{ + struct hrtimer_sleeper t; + struct timespec ts; + unsigned long expires = jiffies + jiff + 1; + int ret; + + /* + * rt_mutex_slowlock can use an interruptible, but this needs to + * be TASK_INTERRUPTIBLE. The down_timeout uses TASK_UNINTERRUPTIBLE. + * To handle this we loop if a signal caused the timeout and the + * we recalculate the new timeout. + * Yes Thomas, this is a hack! But we can fix it right later. + */ + do { + jiffies_to_timespec(jiff, &ts); + hrtimer_init_on_stack(&t.timer, HRTIMER_MODE_REL, CLOCK_MONOTONIC); + t.timer._expires = timespec_to_ktime(ts); + + ret = rt_mutex_timed_lock(&sem->lock, &t, 0); + if (ret != -EINTR) + break; + + /* signal occured, but the down_timeout doesn't handle them */ + jiff = expires - jiffies; + + } while (jiff > 0); + + if (!ret) + __down_complete(sem); + else + ret = -ETIME; + + return ret; +} +EXPORT_SYMBOL(rt_down_timeout); + +/* + * try to down the semaphore, 0 on success and 1 on failure. (inverted) + */ +int rt_down_trylock(struct semaphore *sem) +{ + /* + * Here we are a tiny bit different from ordinary Linux semaphores, + * because we can get 'transient' locking-failures when say a + * process decreases the count from 9 to 8 and locks/releases the + * embedded mutex internally. It would be quite complex to remove + * these transient failures so lets try it the simple way first: + */ + if (rt_mutex_trylock(&sem->lock)) { + __down_complete(sem); + return 0; + } + return 1; +} +EXPORT_SYMBOL(rt_down_trylock); + +void rt_up(struct semaphore *sem) +{ + int count; + + /* + * Disable preemption to make sure a highprio trylock-er cannot + * preempt us here and get into an infinite loop: + */ + preempt_disable(); + count = atomic_inc_return(&sem->count); + /* + * If we did the 0 -> 1 transition then we are the ones to unlock it: + */ + if (likely(count == 1)) + rt_mutex_unlock(&sem->lock); + preempt_enable(); +} +EXPORT_SYMBOL(rt_up); + +void __sema_init(struct semaphore *sem, int val, + char *name, char *file, int line) +{ + atomic_set(&sem->count, val); + switch (val) { + case 0: + __rt_mutex_init(&sem->lock, name); + rt_mutex_lock(&sem->lock); + break; + default: + __rt_mutex_init(&sem->lock, name); + break; + } +} +EXPORT_SYMBOL(__sema_init); + +void __init_MUTEX(struct semaphore *sem, char *name, char *file, + int line) +{ + __sema_init(sem, 1, name, file, line); +} +EXPORT_SYMBOL(__init_MUTEX); + Index: linux-2.6-tip/kernel/rtmutex-debug.c =================================================================== --- linux-2.6-tip.orig/kernel/rtmutex-debug.c +++ linux-2.6-tip/kernel/rtmutex-debug.c @@ -16,6 +16,7 @@ * * See rt.c in preempt-rt for proper credits and further information */ +#include #include #include #include @@ -29,61 +30,6 @@ #include "rtmutex_common.h" -# define TRACE_WARN_ON(x) WARN_ON(x) -# define TRACE_BUG_ON(x) BUG_ON(x) - -# define TRACE_OFF() \ -do { \ - if (rt_trace_on) { \ - rt_trace_on = 0; \ - console_verbose(); \ - if (spin_is_locked(¤t->pi_lock)) \ - spin_unlock(¤t->pi_lock); \ - } \ -} while (0) - -# define TRACE_OFF_NOLOCK() \ -do { \ - if (rt_trace_on) { \ - rt_trace_on = 0; \ - console_verbose(); \ - } \ -} while (0) - -# define TRACE_BUG_LOCKED() \ -do { \ - TRACE_OFF(); \ - BUG(); \ -} while (0) - -# define TRACE_WARN_ON_LOCKED(c) \ -do { \ - if (unlikely(c)) { \ - TRACE_OFF(); \ - WARN_ON(1); \ - } \ -} while (0) - -# define TRACE_BUG_ON_LOCKED(c) \ -do { \ - if (unlikely(c)) \ - TRACE_BUG_LOCKED(); \ -} while (0) - -#ifdef CONFIG_SMP -# define SMP_TRACE_BUG_ON_LOCKED(c) TRACE_BUG_ON_LOCKED(c) -#else -# define SMP_TRACE_BUG_ON_LOCKED(c) do { } while (0) -#endif - -/* - * deadlock detection flag. We turn it off when we detect - * the first problem because we dont want to recurse back - * into the tracing code when doing error printk or - * executing a BUG(): - */ -static int rt_trace_on = 1; - static void printk_task(struct task_struct *p) { if (p) @@ -111,8 +57,8 @@ static void printk_lock(struct rt_mutex void rt_mutex_debug_task_free(struct task_struct *task) { - WARN_ON(!plist_head_empty(&task->pi_waiters)); - WARN_ON(task->pi_blocked_on); + DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters)); + DEBUG_LOCKS_WARN_ON(task->pi_blocked_on); } /* @@ -125,7 +71,7 @@ void debug_rt_mutex_deadlock(int detect, { struct task_struct *task; - if (!rt_trace_on || detect || !act_waiter) + if (!debug_locks || detect || !act_waiter) return; task = rt_mutex_owner(act_waiter->lock); @@ -139,7 +85,7 @@ void debug_rt_mutex_print_deadlock(struc { struct task_struct *task; - if (!waiter->deadlock_lock || !rt_trace_on) + if (!waiter->deadlock_lock || !debug_locks) return; rcu_read_lock(); @@ -149,7 +95,8 @@ void debug_rt_mutex_print_deadlock(struc return; } - TRACE_OFF_NOLOCK(); + if (!debug_locks_off()) + return; printk("\n============================================\n"); printk( "[ BUG: circular locking deadlock detected! ]\n"); @@ -180,7 +127,6 @@ void debug_rt_mutex_print_deadlock(struc printk("[ turning off deadlock detection." "Please report this trace. ]\n\n"); - local_irq_disable(); } void debug_rt_mutex_lock(struct rt_mutex *lock) @@ -189,7 +135,8 @@ void debug_rt_mutex_lock(struct rt_mutex void debug_rt_mutex_unlock(struct rt_mutex *lock) { - TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current); + if (debug_locks) + DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current); } void @@ -199,7 +146,7 @@ debug_rt_mutex_proxy_lock(struct rt_mute void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock) { - TRACE_WARN_ON_LOCKED(!rt_mutex_owner(lock)); + DEBUG_LOCKS_WARN_ON(!rt_mutex_owner(lock)); } void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) @@ -213,9 +160,9 @@ void debug_rt_mutex_init_waiter(struct r void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter) { put_pid(waiter->deadlock_task_pid); - TRACE_WARN_ON(!plist_node_empty(&waiter->list_entry)); - TRACE_WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); - TRACE_WARN_ON(waiter->task); + DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry)); + DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); + DEBUG_LOCKS_WARN_ON(waiter->task); memset(waiter, 0x22, sizeof(*waiter)); } @@ -231,9 +178,36 @@ void debug_rt_mutex_init(struct rt_mutex void rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) { +#ifdef CONFIG_DEBUG_PREEMPT + if (atomic_read(&task->lock_count) >= MAX_LOCK_STACK) { + if (!debug_locks_off()) + return; + printk("BUG: %s/%d: lock count overflow!\n", + task->comm, task->pid); + dump_stack(); + return; + } +#ifdef CONFIG_PREEMPT_RT + task->owned_lock[atomic_read(&task->lock_count)] = lock; +#endif + atomic_inc(&task->lock_count); +#endif } void rt_mutex_deadlock_account_unlock(struct task_struct *task) { +#ifdef CONFIG_DEBUG_PREEMPT + if (!atomic_read(&task->lock_count)) { + if (!debug_locks_off()) + return; + printk("BUG: %s/%d: lock count underflow!\n", + task->comm, task->pid); + dump_stack(); + return; + } + atomic_dec(&task->lock_count); +#ifdef CONFIG_PREEMPT_RT + task->owned_lock[atomic_read(&task->lock_count)] = NULL; +#endif +#endif } - Index: linux-2.6-tip/kernel/rtmutex.c =================================================================== --- linux-2.6-tip.orig/kernel/rtmutex.c +++ linux-2.6-tip/kernel/rtmutex.c @@ -8,12 +8,20 @@ * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen * + * Adaptive Spinlocks: + * Copyright (C) 2008 Novell, Inc., Gregory Haskins, Sven Dietrich, + * and Peter Morreale, + * Adaptive Spinlocks simplification: + * Copyright (C) 2008 Red Hat, Inc., Steven Rostedt + * * See Documentation/rt-mutex-design.txt for details. */ #include #include #include #include +#include +#include #include "rtmutex_common.h" @@ -97,6 +105,22 @@ static inline void mark_rt_mutex_waiters } #endif +int pi_initialized; + +/* + * we initialize the wait_list runtime. (Could be done build-time and/or + * boot-time.) + */ +static inline void init_lists(struct rt_mutex *lock) +{ + if (unlikely(!lock->wait_list.prio_list.prev)) { + plist_head_init(&lock->wait_list, &lock->wait_lock); +#ifdef CONFIG_DEBUG_RT_MUTEXES + pi_initialized++; +#endif + } +} + /* * Calculate task priority from the waiter list priority * @@ -253,13 +277,13 @@ static int rt_mutex_adjust_prio_chain(st plist_add(&waiter->list_entry, &lock->wait_list); /* Release the task */ - spin_unlock_irqrestore(&task->pi_lock, flags); + spin_unlock(&task->pi_lock); put_task_struct(task); /* Grab the next task */ task = rt_mutex_owner(lock); get_task_struct(task); - spin_lock_irqsave(&task->pi_lock, flags); + spin_lock(&task->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { /* Boost the owner */ @@ -277,10 +301,10 @@ static int rt_mutex_adjust_prio_chain(st __rt_mutex_adjust_prio(task); } - spin_unlock_irqrestore(&task->pi_lock, flags); + spin_unlock(&task->pi_lock); top_waiter = rt_mutex_top_waiter(lock); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); if (!detect_deadlock && waiter != top_waiter) goto out_put_task; @@ -300,11 +324,10 @@ static int rt_mutex_adjust_prio_chain(st * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int mode) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; - unsigned long flags; if (!rt_mutex_owner_pending(lock)) return 0; @@ -312,9 +335,9 @@ static inline int try_to_steal_lock(stru if (pendowner == current) return 1; - spin_lock_irqsave(&pendowner->pi_lock, flags); - if (current->prio >= pendowner->prio) { - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_lock(&pendowner->pi_lock); + if (!lock_is_stealable(pendowner, mode)) { + spin_unlock(&pendowner->pi_lock); return 0; } @@ -324,7 +347,7 @@ static inline int try_to_steal_lock(stru * priority. */ if (likely(!rt_mutex_has_waiters(lock))) { - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); return 1; } @@ -332,7 +355,7 @@ static inline int try_to_steal_lock(stru next = rt_mutex_top_waiter(lock); plist_del(&next->pi_list_entry, &pendowner->pi_waiters); __rt_mutex_adjust_prio(pendowner); - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); /* * We are going to steal the lock and a waiter was @@ -349,10 +372,10 @@ static inline int try_to_steal_lock(stru * might be current: */ if (likely(next->task != current)) { - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); plist_add(&next->pi_list_entry, ¤t->pi_waiters); __rt_mutex_adjust_prio(current); - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); } return 1; } @@ -366,7 +389,7 @@ static inline int try_to_steal_lock(stru * * Must be called with lock->wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int do_try_to_take_rt_mutex(struct rt_mutex *lock, int mode) { /* * We have to be careful here if the atomic speedups are @@ -389,7 +412,7 @@ static int try_to_take_rt_mutex(struct r */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) && !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, mode)) return 0; /* We got the lock. */ @@ -402,6 +425,11 @@ static int try_to_take_rt_mutex(struct r return 1; } +static inline int try_to_take_rt_mutex(struct rt_mutex *lock) +{ + return do_try_to_take_rt_mutex(lock, STEAL_NORMAL); +} + /* * Task blocks on lock. * @@ -411,14 +439,13 @@ static int try_to_take_rt_mutex(struct r */ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, - int detect_deadlock) + int detect_deadlock, unsigned long flags) { struct task_struct *owner = rt_mutex_owner(lock); struct rt_mutex_waiter *top_waiter = waiter; - unsigned long flags; int chain_walk = 0, res; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); __rt_mutex_adjust_prio(current); waiter->task = current; waiter->lock = lock; @@ -432,17 +459,17 @@ static int task_blocks_on_rt_mutex(struc current->pi_blocked_on = waiter; - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { - spin_lock_irqsave(&owner->pi_lock, flags); + spin_lock(&owner->pi_lock); plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters); plist_add(&waiter->pi_list_entry, &owner->pi_waiters); __rt_mutex_adjust_prio(owner); if (owner->pi_blocked_on) chain_walk = 1; - spin_unlock_irqrestore(&owner->pi_lock, flags); + spin_unlock(&owner->pi_lock); } else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) chain_walk = 1; @@ -457,12 +484,12 @@ static int task_blocks_on_rt_mutex(struc */ get_task_struct(owner); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter, current); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); return res; } @@ -475,13 +502,13 @@ static int task_blocks_on_rt_mutex(struc * * Called with lock->wait_lock held. */ -static void wakeup_next_waiter(struct rt_mutex *lock) +static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; - unsigned long flags; + struct rt_mutex_waiter *next; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); waiter = rt_mutex_top_waiter(lock); plist_del(&waiter->list_entry, &lock->wait_list); @@ -496,9 +523,44 @@ static void wakeup_next_waiter(struct rt pendowner = waiter->task; waiter->task = NULL; + /* + * Do the wakeup before the ownership change to give any spinning + * waiter grantees a headstart over the other threads that will + * trigger once owner changes. + */ + if (!savestate) + wake_up_process(pendowner); + else { + /* + * We can skip the actual (expensive) wakeup if the + * waiter is already running, but we have to be careful + * of race conditions because they may be about to sleep. + * + * The waiter-side protocol has the following pattern: + * 1: Set state != RUNNING + * 2: Conditionally sleep if waiter->task != NULL; + * + * And the owner-side has the following: + * A: Set waiter->task = NULL + * B: Conditionally wake if the state != RUNNING + * + * As long as we ensure 1->2 order, and A->B order, we + * will never miss a wakeup. + * + * Therefore, this barrier ensures that waiter->task = NULL + * is visible before we test the pendowner->state. The + * corresponding barrier is in the sleep logic. + */ + smp_mb(); + + /* If !RUNNING && !RUNNING_MUTEX */ + if (pendowner->state & ~TASK_RUNNING_MUTEX) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); /* * Clear the pi_blocked_on variable and enqueue a possible @@ -507,7 +569,13 @@ static void wakeup_next_waiter(struct rt * waiter with higher priority than pending-owner->normal_prio * is blocked on the unboosted (pending) owner. */ - spin_lock_irqsave(&pendowner->pi_lock, flags); + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + + spin_lock(&pendowner->pi_lock); WARN_ON(!pendowner->pi_blocked_on); WARN_ON(pendowner->pi_blocked_on != waiter); @@ -515,15 +583,10 @@ static void wakeup_next_waiter(struct rt pendowner->pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(&next->pi_list_entry, &pendowner->pi_waiters); - } - spin_unlock_irqrestore(&pendowner->pi_lock, flags); - wake_up_process(pendowner); + spin_unlock(&pendowner->pi_lock); } /* @@ -532,22 +595,22 @@ static void wakeup_next_waiter(struct rt * Must be called with lock->wait_lock held */ static void remove_waiter(struct rt_mutex *lock, - struct rt_mutex_waiter *waiter) + struct rt_mutex_waiter *waiter, + unsigned long flags) { int first = (waiter == rt_mutex_top_waiter(lock)); struct task_struct *owner = rt_mutex_owner(lock); - unsigned long flags; int chain_walk = 0; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); plist_del(&waiter->list_entry, &lock->wait_list); waiter->task = NULL; current->pi_blocked_on = NULL; - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); if (first && owner != current) { - spin_lock_irqsave(&owner->pi_lock, flags); + spin_lock(&owner->pi_lock); plist_del(&waiter->pi_list_entry, &owner->pi_waiters); @@ -562,7 +625,7 @@ static void remove_waiter(struct rt_mute if (owner->pi_blocked_on) chain_walk = 1; - spin_unlock_irqrestore(&owner->pi_lock, flags); + spin_unlock(&owner->pi_lock); } WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); @@ -573,11 +636,11 @@ static void remove_waiter(struct rt_mute /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(owner); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); } /* @@ -598,14 +661,392 @@ void rt_mutex_adjust_pi(struct task_stru return; } - spin_unlock_irqrestore(&task->pi_lock, flags); - /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(task); + spin_unlock_irqrestore(&task->pi_lock, flags); + rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task); } /* + * preemptible spin_lock functions: + */ + +#ifdef CONFIG_PREEMPT_RT + +static inline void +rt_spin_lock_fastlock(struct rt_mutex *lock, + void (*slowfn)(struct rt_mutex *lock)) +{ + /* Temporary HACK! */ + if (likely(!current->in_printk)) + might_sleep(); + else if (in_atomic() || irqs_disabled()) + /* don't grab locks for printk in atomic */ + return; + + if (likely(rt_mutex_cmpxchg(lock, NULL, current))) + rt_mutex_deadlock_account_lock(lock, current); + else + slowfn(lock); +} + +static inline void +rt_spin_lock_fastunlock(struct rt_mutex *lock, + void (*slowfn)(struct rt_mutex *lock)) +{ + /* Temporary HACK! */ + if (unlikely(rt_mutex_owner(lock) != current) && current->in_printk) + /* don't grab locks for printk in atomic */ + return; + + if (likely(rt_mutex_cmpxchg(lock, current, NULL))) + rt_mutex_deadlock_account_unlock(current); + else + slowfn(lock); +} + + +#ifdef CONFIG_SMP +static int adaptive_wait(struct rt_mutex_waiter *waiter, + struct task_struct *orig_owner) +{ + for (;;) { + + /* we are the owner? */ + if (!waiter->task) + return 0; + + /* Owner changed? Then lets update the original */ + if (orig_owner != rt_mutex_owner(waiter->lock)) + return 0; + + /* Owner went to bed, so should we */ + if (!task_is_current(orig_owner)) + return 1; + + cpu_relax(); + } +} +#else +static int adaptive_wait(struct rt_mutex_waiter *waiter, + struct task_struct *orig_owner) +{ + return 1; +} +#endif + +/* + * The state setting needs to preserve the original state and needs to + * take care of non rtmutex wakeups. + * + * Called with rtmutex->wait_lock held to serialize against rtmutex + * wakeups(). + */ +static inline unsigned long +rt_set_current_blocked_state(unsigned long saved_state) +{ + unsigned long state, block_state; + + /* + * If state is TASK_INTERRUPTIBLE, then we set the state for + * blocking to TASK_INTERRUPTIBLE as well, otherwise we would + * miss real wakeups via wake_up_interruptible(). If such a + * wakeup happens we see the running state and preserve it in + * saved_state. Now we can ignore further wakeups as we will + * return in state running from our "spin" sleep. + */ + if (saved_state == TASK_INTERRUPTIBLE) + block_state = TASK_INTERRUPTIBLE; + else + block_state = TASK_UNINTERRUPTIBLE; + + state = xchg(¤t->state, block_state); + /* + * Take care of non rtmutex wakeups. rtmutex wakeups + * or TASK_RUNNING_MUTEX to (UN)INTERRUPTIBLE. + */ + if (state == TASK_RUNNING) + saved_state = TASK_RUNNING; + + return saved_state; +} + +static inline void rt_restore_current_state(unsigned long saved_state) +{ + unsigned long state = xchg(¤t->state, saved_state); + + if (state == TASK_RUNNING) + current->state = TASK_RUNNING; +} + +/* + * Slow path lock function spin_lock style: this variant is very + * careful not to miss any non-lock wakeups. + * + * The wakeup side uses wake_up_process_mutex, which, combined with + * the xchg code of this function is a transparent sleep/wakeup + * mechanism nested within any existing sleep/wakeup mechanism. This + * enables the seemless use of arbitrary (blocking) spinlocks within + * sleep/wakeup event loops. + */ +static void noinline __sched +rt_spin_lock_slowlock(struct rt_mutex *lock) +{ + struct rt_mutex_waiter waiter; + unsigned long saved_state, flags; + struct task_struct *orig_owner; + + debug_rt_mutex_init_waiter(&waiter); + waiter.task = NULL; + + spin_lock_irqsave(&lock->wait_lock, flags); + init_lists(lock); + + BUG_ON(rt_mutex_owner(lock) == current); + + /* + * Here we save whatever state the task was in originally, + * we'll restore it at the end of the function and we'll take + * any intermediate wakeup into account as well, independently + * of the lock sleep/wakeup mechanism. When we get a real + * wakeup the task->state is TASK_RUNNING and we change + * saved_state accordingly. If we did not get a real wakeup + * then we return with the saved state. We need to be careful + * about original state TASK_INTERRUPTIBLE as well, as we + * could miss a wakeup_interruptible() + */ + saved_state = rt_set_current_blocked_state(current->state); + + for (;;) { + unsigned long saved_flags; + int saved_lock_depth = current->lock_depth; + + /* Try to acquire the lock */ + if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) + break; + + /* + * waiter.task is NULL the first time we come here and + * when we have been woken up by the previous owner + * but the lock got stolen by an higher prio task. + */ + if (!waiter.task) { + task_blocks_on_rt_mutex(lock, &waiter, 0, flags); + /* Wakeup during boost ? */ + if (unlikely(!waiter.task)) + continue; + } + + /* + * Prevent schedule() to drop BKL, while waiting for + * the lock ! We restore lock_depth when we come back. + */ + saved_flags = current->flags & PF_NOSCHED; + current->lock_depth = -1; + current->flags &= ~PF_NOSCHED; + orig_owner = rt_mutex_owner(lock); + get_task_struct(orig_owner); + spin_unlock_irqrestore(&lock->wait_lock, flags); + + debug_rt_mutex_print_deadlock(&waiter); + + if (adaptive_wait(&waiter, orig_owner)) { + put_task_struct(orig_owner); + + if (waiter.task) + schedule_rt_mutex(lock); + } else + put_task_struct(orig_owner); + + spin_lock_irqsave(&lock->wait_lock, flags); + current->flags |= saved_flags; + current->lock_depth = saved_lock_depth; + saved_state = rt_set_current_blocked_state(saved_state); + } + + rt_restore_current_state(saved_state); + + /* + * Extremely rare case, if we got woken up by a non-mutex wakeup, + * and we managed to steal the lock despite us not being the + * highest-prio waiter (due to SCHED_OTHER changing prio), then we + * can end up with a non-NULL waiter.task: + */ + if (unlikely(waiter.task)) + remove_waiter(lock, &waiter, flags); + /* + * try_to_take_rt_mutex() sets the waiter bit + * unconditionally. We might have to fix that up: + */ + fixup_rt_mutex_waiters(lock); + + spin_unlock_irqrestore(&lock->wait_lock, flags); + + debug_rt_mutex_free_waiter(&waiter); +} + +/* + * Slow path to release a rt_mutex spin_lock style + */ +static void noinline __sched +rt_spin_lock_slowunlock(struct rt_mutex *lock) +{ + unsigned long flags; + + spin_lock_irqsave(&lock->wait_lock, flags); + + debug_rt_mutex_unlock(lock); + + rt_mutex_deadlock_account_unlock(current); + + if (!rt_mutex_has_waiters(lock)) { + lock->owner = NULL; + spin_unlock_irqrestore(&lock->wait_lock, flags); + return; + } + + wakeup_next_waiter(lock, 1); + + spin_unlock_irqrestore(&lock->wait_lock, flags); + + /* Undo pi boosting.when necessary */ + rt_mutex_adjust_prio(current); +} + +void __lockfunc rt_spin_lock(spinlock_t *lock) +{ + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); + spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); +} +EXPORT_SYMBOL(rt_spin_lock); + +void __lockfunc __rt_spin_lock(struct rt_mutex *lock) +{ + rt_spin_lock_fastlock(lock, rt_spin_lock_slowlock); +} +EXPORT_SYMBOL(__rt_spin_lock); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass) +{ + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); +} +EXPORT_SYMBOL(rt_spin_lock_nested); + +#endif + +void __lockfunc rt_spin_unlock(spinlock_t *lock) +{ + /* NOTE: we always pass in '1' for nested, for simplicity */ + spin_release(&lock->dep_map, 1, _RET_IP_); + rt_spin_lock_fastunlock(&lock->lock, rt_spin_lock_slowunlock); +} +EXPORT_SYMBOL(rt_spin_unlock); + +void __lockfunc __rt_spin_unlock(struct rt_mutex *lock) +{ + rt_spin_lock_fastunlock(lock, rt_spin_lock_slowunlock); +} +EXPORT_SYMBOL(__rt_spin_unlock); + +/* + * Wait for the lock to get unlocked: instead of polling for an unlock + * (like raw spinlocks do), we lock and unlock, to force the kernel to + * schedule if there's contention: + */ +void __lockfunc rt_spin_unlock_wait(spinlock_t *lock) +{ + spin_lock(lock); + spin_unlock(lock); +} +EXPORT_SYMBOL(rt_spin_unlock_wait); + +int __lockfunc rt_spin_trylock(spinlock_t *lock) +{ + int ret = rt_mutex_trylock(&lock->lock); + + if (ret) + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock); + +int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags) +{ + int ret; + + *flags = 0; + ret = rt_mutex_trylock(&lock->lock); + if (ret) + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock_irqsave); + +int _atomic_dec_and_spin_lock(spinlock_t *lock, atomic_t *atomic) +{ + /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ + if (atomic_add_unless(atomic, -1, 1)) + return 0; + rt_spin_lock(lock); + if (atomic_dec_and_test(atomic)) + return 1; + rt_spin_unlock(lock); + return 0; +} +EXPORT_SYMBOL(_atomic_dec_and_spin_lock); + +void +__rt_spin_lock_init(spinlock_t *lock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&lock->lock, name); +} +EXPORT_SYMBOL(__rt_spin_lock_init); + +#endif + +static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags) +{ + int saved_lock_depth = current->lock_depth; + +#ifdef CONFIG_LOCK_KERNEL + current->lock_depth = -1; + /* + * try_to_take_lock set the waiters, make sure it's + * still correct. + */ + fixup_rt_mutex_waiters(lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); + + up(&kernel_sem); + + spin_lock_irq(&lock->wait_lock); +#endif + + return saved_lock_depth; +} + +static inline void rt_reacquire_bkl(int saved_lock_depth) +{ +#ifdef CONFIG_LOCK_KERNEL + down(&kernel_sem); + current->lock_depth = saved_lock_depth; +#endif +} + +/* * Slow path lock function: */ static int __sched @@ -613,20 +1054,29 @@ rt_mutex_slowlock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout, int detect_deadlock) { + int ret = 0, saved_lock_depth = -1; struct rt_mutex_waiter waiter; - int ret = 0; + unsigned long flags; debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; - spin_lock(&lock->wait_lock); + spin_lock_irqsave(&lock->wait_lock, flags); + init_lists(lock); /* Try to acquire the lock again: */ if (try_to_take_rt_mutex(lock)) { - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return 0; } + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(lock, flags); + set_current_state(state); /* Setup the timer, when timeout != NULL */ @@ -637,6 +1087,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, } for (;;) { + unsigned long saved_flags; + /* Try to acquire the lock: */ if (try_to_take_rt_mutex(lock)) break; @@ -662,7 +1114,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, */ if (!waiter.task) { ret = task_blocks_on_rt_mutex(lock, &waiter, - detect_deadlock); + detect_deadlock, flags); /* * If we got woken up by the owner then start loop * all over without going into schedule to try @@ -681,22 +1133,26 @@ rt_mutex_slowlock(struct rt_mutex *lock, if (unlikely(ret)) break; } + saved_flags = current->flags & PF_NOSCHED; + current->flags &= ~PF_NOSCHED; - spin_unlock(&lock->wait_lock); + spin_unlock_irq(&lock->wait_lock); debug_rt_mutex_print_deadlock(&waiter); if (waiter.task) schedule_rt_mutex(lock); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); + + current->flags |= saved_flags; set_current_state(state); } set_current_state(TASK_RUNNING); if (unlikely(waiter.task)) - remove_waiter(lock, &waiter); + remove_waiter(lock, &waiter, flags); /* * try_to_take_rt_mutex() sets the waiter bit @@ -704,7 +1160,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, */ fixup_rt_mutex_waiters(lock); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); /* Remove pending timer: */ if (unlikely(timeout)) @@ -718,6 +1174,10 @@ rt_mutex_slowlock(struct rt_mutex *lock, if (unlikely(ret)) rt_mutex_adjust_prio(current); + /* Must we reaquire the BKL? */ + if (unlikely(saved_lock_depth >= 0)) + rt_reacquire_bkl(saved_lock_depth); + debug_rt_mutex_free_waiter(&waiter); return ret; @@ -729,12 +1189,15 @@ rt_mutex_slowlock(struct rt_mutex *lock, static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) { + unsigned long flags; int ret = 0; - spin_lock(&lock->wait_lock); + spin_lock_irqsave(&lock->wait_lock, flags); if (likely(rt_mutex_owner(lock) != current)) { + init_lists(lock); + ret = try_to_take_rt_mutex(lock); /* * try_to_take_rt_mutex() sets the lock waiters @@ -743,7 +1206,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lo fixup_rt_mutex_waiters(lock); } - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return ret; } @@ -754,7 +1217,9 @@ rt_mutex_slowtrylock(struct rt_mutex *lo static void __sched rt_mutex_slowunlock(struct rt_mutex *lock) { - spin_lock(&lock->wait_lock); + unsigned long flags; + + spin_lock_irqsave(&lock->wait_lock, flags); debug_rt_mutex_unlock(lock); @@ -762,13 +1227,13 @@ rt_mutex_slowunlock(struct rt_mutex *loc if (!rt_mutex_has_waiters(lock)) { lock->owner = NULL; - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return; } - wakeup_next_waiter(lock); + wakeup_next_waiter(lock, 0); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); /* Undo pi boosting if necessary: */ rt_mutex_adjust_prio(current); @@ -830,6 +1295,27 @@ rt_mutex_fastunlock(struct rt_mutex *loc } /** + * rt_mutex_lock_killable - lock a rt_mutex killable + * + * @lock: the rt_mutex to be locked + * @detect_deadlock: deadlock detection on/off + * + * Returns: + * 0 on success + * -EINTR when interrupted by a signal + * -EDEADLK when the lock would deadlock (when deadlock detection is on) + */ +int __sched rt_mutex_lock_killable(struct rt_mutex *lock, + int detect_deadlock) +{ + might_sleep(); + + return rt_mutex_fastlock(lock, TASK_KILLABLE, + detect_deadlock, rt_mutex_slowlock); +} +EXPORT_SYMBOL_GPL(rt_mutex_lock_killable); + +/** * rt_mutex_lock - lock a rt_mutex * * @lock: the rt_mutex to be locked Index: linux-2.6-tip/kernel/rwsem.c =================================================================== --- linux-2.6-tip.orig/kernel/rwsem.c +++ linux-2.6-tip/kernel/rwsem.c @@ -16,7 +16,7 @@ /* * lock for reading */ -void __sched down_read(struct rw_semaphore *sem) +void __sched compat_down_read(struct compat_rw_semaphore *sem) { might_sleep(); rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); @@ -24,12 +24,12 @@ void __sched down_read(struct rw_semapho LOCK_CONTENDED(sem, __down_read_trylock, __down_read); } -EXPORT_SYMBOL(down_read); +EXPORT_SYMBOL(compat_down_read); /* * trylock for reading -- returns 1 if successful, 0 if contention */ -int down_read_trylock(struct rw_semaphore *sem) +int compat_down_read_trylock(struct compat_rw_semaphore *sem) { int ret = __down_read_trylock(sem); @@ -38,12 +38,12 @@ int down_read_trylock(struct rw_semaphor return ret; } -EXPORT_SYMBOL(down_read_trylock); +EXPORT_SYMBOL(compat_down_read_trylock); /* * lock for writing */ -void __sched down_write(struct rw_semaphore *sem) +void __sched compat_down_write(struct compat_rw_semaphore *sem) { might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); @@ -51,12 +51,12 @@ void __sched down_write(struct rw_semaph LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } -EXPORT_SYMBOL(down_write); +EXPORT_SYMBOL(compat_down_write); /* * trylock for writing -- returns 1 if successful, 0 if contention */ -int down_write_trylock(struct rw_semaphore *sem) +int compat_down_write_trylock(struct compat_rw_semaphore *sem) { int ret = __down_write_trylock(sem); @@ -65,36 +65,36 @@ int down_write_trylock(struct rw_semapho return ret; } -EXPORT_SYMBOL(down_write_trylock); +EXPORT_SYMBOL(compat_down_write_trylock); /* * release a read lock */ -void up_read(struct rw_semaphore *sem) +void compat_up_read(struct compat_rw_semaphore *sem) { rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_read(sem); } -EXPORT_SYMBOL(up_read); +EXPORT_SYMBOL(compat_up_read); /* * release a write lock */ -void up_write(struct rw_semaphore *sem) +void compat_up_write(struct compat_rw_semaphore *sem) { rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); } -EXPORT_SYMBOL(up_write); +EXPORT_SYMBOL(compat_up_write); /* * downgrade write lock to read lock */ -void downgrade_write(struct rw_semaphore *sem) +void compat_downgrade_write(struct compat_rw_semaphore *sem) { /* * lockdep: a downgraded write will live on as a write @@ -103,11 +103,11 @@ void downgrade_write(struct rw_semaphore __downgrade_write(sem); } -EXPORT_SYMBOL(downgrade_write); +EXPORT_SYMBOL(compat_downgrade_write); #ifdef CONFIG_DEBUG_LOCK_ALLOC -void down_read_nested(struct rw_semaphore *sem, int subclass) +void compat_down_read_nested(struct compat_rw_semaphore *sem, int subclass) { might_sleep(); rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_); @@ -115,18 +115,18 @@ void down_read_nested(struct rw_semaphor LOCK_CONTENDED(sem, __down_read_trylock, __down_read); } -EXPORT_SYMBOL(down_read_nested); +EXPORT_SYMBOL(compat_down_read_nested); -void down_read_non_owner(struct rw_semaphore *sem) +void compat_down_read_non_owner(struct compat_rw_semaphore *sem) { might_sleep(); __down_read(sem); } -EXPORT_SYMBOL(down_read_non_owner); +EXPORT_SYMBOL(compat_down_read_non_owner); -void down_write_nested(struct rw_semaphore *sem, int subclass) +void compat_down_write_nested(struct compat_rw_semaphore *sem, int subclass) { might_sleep(); rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); @@ -134,14 +134,14 @@ void down_write_nested(struct rw_semapho LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } -EXPORT_SYMBOL(down_write_nested); +EXPORT_SYMBOL(compat_down_write_nested); -void up_read_non_owner(struct rw_semaphore *sem) +void compat_up_read_non_owner(struct compat_rw_semaphore *sem) { __up_read(sem); } -EXPORT_SYMBOL(up_read_non_owner); +EXPORT_SYMBOL(compat_up_read_non_owner); #endif Index: linux-2.6-tip/kernel/sched_clock.c =================================================================== --- linux-2.6-tip.orig/kernel/sched_clock.c +++ linux-2.6-tip/kernel/sched_clock.c @@ -50,7 +50,7 @@ struct sched_clock_data { * from within instrumentation code so we dont want to do any * instrumentation ourselves. */ - raw_spinlock_t lock; + __raw_spinlock_t lock; u64 tick_raw; u64 tick_gtod; @@ -77,7 +77,7 @@ void sched_clock_init(void) for_each_possible_cpu(cpu) { struct sched_clock_data *scd = cpu_sdc(cpu); - scd->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + scd->lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; scd->tick_raw = 0; scd->tick_gtod = ktime_now; scd->clock = ktime_now; Index: linux-2.6-tip/kernel/semaphore.c =================================================================== --- linux-2.6-tip.orig/kernel/semaphore.c +++ linux-2.6-tip/kernel/semaphore.c @@ -33,11 +33,11 @@ #include #include -static noinline void __down(struct semaphore *sem); -static noinline int __down_interruptible(struct semaphore *sem); -static noinline int __down_killable(struct semaphore *sem); -static noinline int __down_timeout(struct semaphore *sem, long jiffies); -static noinline void __up(struct semaphore *sem); +static noinline void __down(struct compat_semaphore *sem); +static noinline int __down_interruptible(struct compat_semaphore *sem); +static noinline int __down_killable(struct compat_semaphore *sem); +static noinline int __down_timeout(struct compat_semaphore *sem, long jiffies); +static noinline void __up(struct compat_semaphore *sem); /** * down - acquire the semaphore @@ -50,7 +50,7 @@ static noinline void __up(struct semapho * Use of this function is deprecated, please use down_interruptible() or * down_killable() instead. */ -void down(struct semaphore *sem) +void compat_down(struct compat_semaphore *sem) { unsigned long flags; @@ -61,7 +61,7 @@ void down(struct semaphore *sem) __down(sem); spin_unlock_irqrestore(&sem->lock, flags); } -EXPORT_SYMBOL(down); +EXPORT_SYMBOL(compat_down); /** * down_interruptible - acquire the semaphore unless interrupted @@ -72,7 +72,7 @@ EXPORT_SYMBOL(down); * If the sleep is interrupted by a signal, this function will return -EINTR. * If the semaphore is successfully acquired, this function returns 0. */ -int down_interruptible(struct semaphore *sem) +int compat_down_interruptible(struct compat_semaphore *sem) { unsigned long flags; int result = 0; @@ -86,7 +86,7 @@ int down_interruptible(struct semaphore return result; } -EXPORT_SYMBOL(down_interruptible); +EXPORT_SYMBOL(compat_down_interruptible); /** * down_killable - acquire the semaphore unless killed @@ -98,7 +98,7 @@ EXPORT_SYMBOL(down_interruptible); * -EINTR. If the semaphore is successfully acquired, this function returns * 0. */ -int down_killable(struct semaphore *sem) +int compat_down_killable(struct compat_semaphore *sem) { unsigned long flags; int result = 0; @@ -112,7 +112,7 @@ int down_killable(struct semaphore *sem) return result; } -EXPORT_SYMBOL(down_killable); +EXPORT_SYMBOL(compat_down_killable); /** * down_trylock - try to acquire the semaphore, without waiting @@ -127,7 +127,7 @@ EXPORT_SYMBOL(down_killable); * Unlike mutex_trylock, this function can be used from interrupt context, * and the semaphore can be released by any task or interrupt. */ -int down_trylock(struct semaphore *sem) +int compat_down_trylock(struct compat_semaphore *sem) { unsigned long flags; int count; @@ -140,7 +140,7 @@ int down_trylock(struct semaphore *sem) return (count < 0); } -EXPORT_SYMBOL(down_trylock); +EXPORT_SYMBOL(compat_down_trylock); /** * down_timeout - acquire the semaphore within a specified time @@ -152,7 +152,7 @@ EXPORT_SYMBOL(down_trylock); * If the semaphore is not released within the specified number of jiffies, * this function returns -ETIME. It returns 0 if the semaphore was acquired. */ -int down_timeout(struct semaphore *sem, long jiffies) +int compat_down_timeout(struct compat_semaphore *sem, long jiffies) { unsigned long flags; int result = 0; @@ -166,7 +166,7 @@ int down_timeout(struct semaphore *sem, return result; } -EXPORT_SYMBOL(down_timeout); +EXPORT_SYMBOL(compat_down_timeout); /** * up - release the semaphore @@ -175,7 +175,7 @@ EXPORT_SYMBOL(down_timeout); * Release the semaphore. Unlike mutexes, up() may be called from any * context and even by tasks which have never called down(). */ -void up(struct semaphore *sem) +void compat_up(struct compat_semaphore *sem) { unsigned long flags; @@ -186,7 +186,7 @@ void up(struct semaphore *sem) __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } -EXPORT_SYMBOL(up); +EXPORT_SYMBOL(compat_up); /* Functions for the contended case */ @@ -201,7 +201,7 @@ struct semaphore_waiter { * constant, and thus optimised away by the compiler. Likewise the * 'timeout' parameter for the cases without timeouts. */ -static inline int __sched __down_common(struct semaphore *sem, long state, +static inline int __sched __down_common(struct compat_semaphore *sem, long state, long timeout) { struct task_struct *task = current; @@ -233,27 +233,27 @@ static inline int __sched __down_common( return -EINTR; } -static noinline void __sched __down(struct semaphore *sem) +static noinline void __sched __down(struct compat_semaphore *sem) { __down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } -static noinline int __sched __down_interruptible(struct semaphore *sem) +static noinline int __sched __down_interruptible(struct compat_semaphore *sem) { return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT); } -static noinline int __sched __down_killable(struct semaphore *sem) +static noinline int __sched __down_killable(struct compat_semaphore *sem) { return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT); } -static noinline int __sched __down_timeout(struct semaphore *sem, long jiffies) +static noinline int __sched __down_timeout(struct compat_semaphore *sem, long jiffies) { return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies); } -static noinline void __sched __up(struct semaphore *sem) +static noinline void __sched __up(struct compat_semaphore *sem) { struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); Index: linux-2.6-tip/kernel/spinlock.c =================================================================== --- linux-2.6-tip.orig/kernel/spinlock.c +++ linux-2.6-tip/kernel/spinlock.c @@ -21,7 +21,7 @@ #include #include -int __lockfunc _spin_trylock(spinlock_t *lock) +int __lockfunc __spin_trylock(raw_spinlock_t *lock) { preempt_disable(); if (_raw_spin_trylock(lock)) { @@ -32,9 +32,46 @@ int __lockfunc _spin_trylock(spinlock_t preempt_enable(); return 0; } -EXPORT_SYMBOL(_spin_trylock); +EXPORT_SYMBOL(__spin_trylock); -int __lockfunc _read_trylock(rwlock_t *lock) +int __lockfunc __spin_trylock_irq(raw_spinlock_t *lock) +{ + local_irq_disable(); + preempt_disable(); + + if (_raw_spin_trylock(lock)) { + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + return 1; + } + + __preempt_enable_no_resched(); + local_irq_enable(); + preempt_check_resched(); + + return 0; +} +EXPORT_SYMBOL(__spin_trylock_irq); + +int __lockfunc __spin_trylock_irqsave(raw_spinlock_t *lock, + unsigned long *flags) +{ + local_irq_save(*flags); + preempt_disable(); + + if (_raw_spin_trylock(lock)) { + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + return 1; + } + + __preempt_enable_no_resched(); + local_irq_restore(*flags); + preempt_check_resched(); + + return 0; +} +EXPORT_SYMBOL(__spin_trylock_irqsave); + +int __lockfunc __read_trylock(raw_rwlock_t *lock) { preempt_disable(); if (_raw_read_trylock(lock)) { @@ -45,9 +82,9 @@ int __lockfunc _read_trylock(rwlock_t *l preempt_enable(); return 0; } -EXPORT_SYMBOL(_read_trylock); +EXPORT_SYMBOL(__read_trylock); -int __lockfunc _write_trylock(rwlock_t *lock) +int __lockfunc __write_trylock(raw_rwlock_t *lock) { preempt_disable(); if (_raw_write_trylock(lock)) { @@ -58,7 +95,21 @@ int __lockfunc _write_trylock(rwlock_t * preempt_enable(); return 0; } -EXPORT_SYMBOL(_write_trylock); +EXPORT_SYMBOL(__write_trylock); + +int __lockfunc __write_trylock_irqsave(raw_rwlock_t *lock, unsigned long *flags) +{ + int ret; + + local_irq_save(*flags); + ret = __write_trylock(lock); + if (ret) + return ret; + + local_irq_restore(*flags); + return 0; +} +EXPORT_SYMBOL(__write_trylock_irqsave); /* * If lockdep is enabled then we use the non-preemption spin-ops @@ -67,15 +118,15 @@ EXPORT_SYMBOL(_write_trylock); */ #if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC) -void __lockfunc _read_lock(rwlock_t *lock) +void __lockfunc __read_lock(raw_rwlock_t *lock) { preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock); +EXPORT_SYMBOL(__read_lock); -unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +unsigned long __lockfunc __spin_lock_irqsave(raw_spinlock_t *lock) { unsigned long flags; @@ -94,27 +145,27 @@ unsigned long __lockfunc _spin_lock_irqs #endif return flags; } -EXPORT_SYMBOL(_spin_lock_irqsave); +EXPORT_SYMBOL(__spin_lock_irqsave); -void __lockfunc _spin_lock_irq(spinlock_t *lock) +void __lockfunc __spin_lock_irq(raw_spinlock_t *lock) { local_irq_disable(); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock_irq); +EXPORT_SYMBOL(__spin_lock_irq); -void __lockfunc _spin_lock_bh(spinlock_t *lock) +void __lockfunc __spin_lock_bh(raw_spinlock_t *lock) { local_bh_disable(); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock_bh); +EXPORT_SYMBOL(__spin_lock_bh); -unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock) +unsigned long __lockfunc __read_lock_irqsave(raw_rwlock_t *lock) { unsigned long flags; @@ -124,27 +175,27 @@ unsigned long __lockfunc _read_lock_irqs LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); return flags; } -EXPORT_SYMBOL(_read_lock_irqsave); +EXPORT_SYMBOL(__read_lock_irqsave); -void __lockfunc _read_lock_irq(rwlock_t *lock) +void __lockfunc __read_lock_irq(raw_rwlock_t *lock) { local_irq_disable(); preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock_irq); +EXPORT_SYMBOL(__read_lock_irq); -void __lockfunc _read_lock_bh(rwlock_t *lock) +void __lockfunc __read_lock_bh(raw_rwlock_t *lock) { local_bh_disable(); preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock_bh); +EXPORT_SYMBOL(__read_lock_bh); -unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock) +unsigned long __lockfunc __write_lock_irqsave(raw_rwlock_t *lock) { unsigned long flags; @@ -154,43 +205,43 @@ unsigned long __lockfunc _write_lock_irq LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); return flags; } -EXPORT_SYMBOL(_write_lock_irqsave); +EXPORT_SYMBOL(__write_lock_irqsave); -void __lockfunc _write_lock_irq(rwlock_t *lock) +void __lockfunc __write_lock_irq(raw_rwlock_t *lock) { local_irq_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock_irq); +EXPORT_SYMBOL(__write_lock_irq); -void __lockfunc _write_lock_bh(rwlock_t *lock) +void __lockfunc __write_lock_bh(raw_rwlock_t *lock) { local_bh_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock_bh); +EXPORT_SYMBOL(__write_lock_bh); -void __lockfunc _spin_lock(spinlock_t *lock) +void __lockfunc __spin_lock(raw_spinlock_t *lock) { preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock); +EXPORT_SYMBOL(__spin_lock); -void __lockfunc _write_lock(rwlock_t *lock) +void __lockfunc __write_lock(raw_rwlock_t *lock) { preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock); +EXPORT_SYMBOL(__write_lock); #else /* CONFIG_PREEMPT: */ @@ -203,7 +254,7 @@ EXPORT_SYMBOL(_write_lock); */ #define BUILD_LOCK_OPS(op, locktype) \ -void __lockfunc _##op##_lock(locktype##_t *lock) \ +void __lockfunc __##op##_lock(locktype##_t *lock) \ { \ for (;;) { \ preempt_disable(); \ @@ -213,15 +264,16 @@ void __lockfunc _##op##_lock(locktype##_ \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ - while (!op##_can_lock(lock) && (lock)->break_lock) \ - _raw_##op##_relax(&lock->raw_lock); \ + while (!__raw_##op##_can_lock(&(lock)->raw_lock) && \ + (lock)->break_lock) \ + __raw_##op##_relax(&lock->raw_lock); \ } \ (lock)->break_lock = 0; \ } \ \ -EXPORT_SYMBOL(_##op##_lock); \ +EXPORT_SYMBOL(__##op##_lock); \ \ -unsigned long __lockfunc _##op##_lock_irqsave(locktype##_t *lock) \ +unsigned long __lockfunc __##op##_lock_irqsave(locktype##_t *lock) \ { \ unsigned long flags; \ \ @@ -235,23 +287,24 @@ unsigned long __lockfunc _##op##_lock_ir \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ - while (!op##_can_lock(lock) && (lock)->break_lock) \ - _raw_##op##_relax(&lock->raw_lock); \ + while (!__raw_##op##_can_lock(&(lock)->raw_lock) && \ + (lock)->break_lock) \ + __raw_##op##_relax(&lock->raw_lock); \ } \ (lock)->break_lock = 0; \ return flags; \ } \ \ -EXPORT_SYMBOL(_##op##_lock_irqsave); \ +EXPORT_SYMBOL(__##op##_lock_irqsave); \ \ -void __lockfunc _##op##_lock_irq(locktype##_t *lock) \ +void __lockfunc __##op##_lock_irq(locktype##_t *lock) \ { \ - _##op##_lock_irqsave(lock); \ + __##op##_lock_irqsave(lock); \ } \ \ -EXPORT_SYMBOL(_##op##_lock_irq); \ +EXPORT_SYMBOL(__##op##_lock_irq); \ \ -void __lockfunc _##op##_lock_bh(locktype##_t *lock) \ +void __lockfunc __##op##_lock_bh(locktype##_t *lock) \ { \ unsigned long flags; \ \ @@ -260,39 +313,48 @@ void __lockfunc _##op##_lock_bh(locktype /* irq-disabling. We use the generic preemption-aware */ \ /* function: */ \ /**/ \ - flags = _##op##_lock_irqsave(lock); \ + flags = __##op##_lock_irqsave(lock); \ local_bh_disable(); \ local_irq_restore(flags); \ } \ \ -EXPORT_SYMBOL(_##op##_lock_bh) +EXPORT_SYMBOL(__##op##_lock_bh) /* * Build preemption-friendly versions of the following * lock-spinning functions: * - * _[spin|read|write]_lock() - * _[spin|read|write]_lock_irq() - * _[spin|read|write]_lock_irqsave() - * _[spin|read|write]_lock_bh() + * __[spin|read|write]_lock() + * __[spin|read|write]_lock_irq() + * __[spin|read|write]_lock_irqsave() + * __[spin|read|write]_lock_bh() */ -BUILD_LOCK_OPS(spin, spinlock); -BUILD_LOCK_OPS(read, rwlock); -BUILD_LOCK_OPS(write, rwlock); +BUILD_LOCK_OPS(spin, raw_spinlock); +BUILD_LOCK_OPS(read, raw_rwlock); +BUILD_LOCK_OPS(write, raw_rwlock); #endif /* CONFIG_PREEMPT */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass) +void __lockfunc __spin_lock_nested(raw_spinlock_t *lock, int subclass) { preempt_disable(); spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock_nested); +EXPORT_SYMBOL(__spin_lock_nested); + +void __lockfunc __spin_lock_nest_lock(raw_spinlock_t *lock, + struct lockdep_map *nest_lock) +{ + preempt_disable(); + spin_acquire_nest(&lock->dep_map, 0, 0, nest_lock, _RET_IP_); + LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); +} +EXPORT_SYMBOL(__spin_lock_nest_lock); -unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +unsigned long __lockfunc __spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) { unsigned long flags; @@ -311,125 +373,130 @@ unsigned long __lockfunc _spin_lock_irqs #endif return flags; } -EXPORT_SYMBOL(_spin_lock_irqsave_nested); - -void __lockfunc _spin_lock_nest_lock(spinlock_t *lock, - struct lockdep_map *nest_lock) -{ - preempt_disable(); - spin_acquire_nest(&lock->dep_map, 0, 0, nest_lock, _RET_IP_); - LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); -} -EXPORT_SYMBOL(_spin_lock_nest_lock); +EXPORT_SYMBOL(__spin_lock_irqsave_nested); #endif -void __lockfunc _spin_unlock(spinlock_t *lock) +void __lockfunc __spin_unlock(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_spin_unlock); +EXPORT_SYMBOL(__spin_unlock); -void __lockfunc _write_unlock(rwlock_t *lock) +void __lockfunc __spin_unlock_no_resched(raw_spinlock_t *lock) +{ + spin_release(&lock->dep_map, 1, _RET_IP_); + _raw_spin_unlock(lock); + __preempt_enable_no_resched(); +} +/* not exported */ + +void __lockfunc __write_unlock(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_write_unlock); +EXPORT_SYMBOL(__write_unlock); -void __lockfunc _read_unlock(rwlock_t *lock) +void __lockfunc __read_unlock(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_read_unlock); +EXPORT_SYMBOL(__read_unlock); -void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) +void __lockfunc __spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_spin_unlock_irqrestore); +EXPORT_SYMBOL(__spin_unlock_irqrestore); -void __lockfunc _spin_unlock_irq(spinlock_t *lock) +void __lockfunc __spin_unlock_irq(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_spin_unlock_irq); +EXPORT_SYMBOL(__spin_unlock_irq); -void __lockfunc _spin_unlock_bh(spinlock_t *lock) +void __lockfunc __spin_unlock_bh(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_spin_unlock_bh); +EXPORT_SYMBOL(__spin_unlock_bh); -void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags) +void __lockfunc __read_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_read_unlock_irqrestore); +EXPORT_SYMBOL(__read_unlock_irqrestore); -void __lockfunc _read_unlock_irq(rwlock_t *lock) +void __lockfunc __read_unlock_irq(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_read_unlock_irq); +EXPORT_SYMBOL(__read_unlock_irq); -void __lockfunc _read_unlock_bh(rwlock_t *lock) +void __lockfunc __read_unlock_bh(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_read_unlock_bh); +EXPORT_SYMBOL(__read_unlock_bh); -void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags) +void __lockfunc __write_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_write_unlock_irqrestore); +EXPORT_SYMBOL(__write_unlock_irqrestore); -void __lockfunc _write_unlock_irq(rwlock_t *lock) +void __lockfunc __write_unlock_irq(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_write_unlock_irq); +EXPORT_SYMBOL(__write_unlock_irq); -void __lockfunc _write_unlock_bh(rwlock_t *lock) +void __lockfunc __write_unlock_bh(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_write_unlock_bh); +EXPORT_SYMBOL(__write_unlock_bh); -int __lockfunc _spin_trylock_bh(spinlock_t *lock) +int __lockfunc __spin_trylock_bh(raw_spinlock_t *lock) { local_bh_disable(); preempt_disable(); @@ -438,11 +505,11 @@ int __lockfunc _spin_trylock_bh(spinlock return 1; } - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); return 0; } -EXPORT_SYMBOL(_spin_trylock_bh); +EXPORT_SYMBOL(__spin_trylock_bh); notrace int in_lock_functions(unsigned long addr) { @@ -450,6 +517,17 @@ notrace int in_lock_functions(unsigned l extern char __lock_text_start[], __lock_text_end[]; return addr >= (unsigned long)__lock_text_start - && addr < (unsigned long)__lock_text_end; + && addr < (unsigned long)__lock_text_end; } EXPORT_SYMBOL(in_lock_functions); + +void notrace __debug_atomic_dec_and_test(atomic_t *v) +{ + static int warn_once = 1; + + if (!atomic_read(v) && warn_once) { + warn_once = 0; + printk("BUG: atomic counter underflow!\n"); + WARN_ON(1); + } +} Index: linux-2.6-tip/lib/dec_and_lock.c =================================================================== --- linux-2.6-tip.orig/lib/dec_and_lock.c +++ linux-2.6-tip/lib/dec_and_lock.c @@ -17,7 +17,7 @@ * because the spin-lock and the decrement must be * "atomic". */ -int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +int __atomic_dec_and_spin_lock(raw_spinlock_t *lock, atomic_t *atomic) { #ifdef CONFIG_SMP /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ @@ -32,4 +32,4 @@ int _atomic_dec_and_lock(atomic_t *atomi return 0; } -EXPORT_SYMBOL(_atomic_dec_and_lock); +EXPORT_SYMBOL(__atomic_dec_and_spin_lock); Index: linux-2.6-tip/lib/locking-selftest.c =================================================================== --- linux-2.6-tip.orig/lib/locking-selftest.c +++ linux-2.6-tip/lib/locking-selftest.c @@ -158,7 +158,7 @@ static void init_shared_classes(void) local_bh_disable(); \ local_irq_disable(); \ trace_softirq_enter(); \ - WARN_ON(!in_softirq()); + /* FIXME: preemptible softirqs. WARN_ON(!in_softirq()); */ #define SOFTIRQ_EXIT() \ trace_softirq_exit(); \ @@ -550,6 +550,11 @@ GENERATE_TESTCASE(init_held_rsem) #undef E /* + * FIXME: turns these into raw-spinlock tests on -rt + */ +#ifndef CONFIG_PREEMPT_RT + +/* * locking an irq-safe lock with irqs enabled: */ #define E1() \ @@ -890,6 +895,8 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_ #include "locking-selftest-softirq.h" // GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft) +#endif /* !CONFIG_PREEMPT_RT */ + #ifdef CONFIG_DEBUG_LOCK_ALLOC # define I_SPINLOCK(x) lockdep_reset_lock(&lock_##x.dep_map) # define I_RWLOCK(x) lockdep_reset_lock(&rwlock_##x.dep_map) @@ -940,6 +947,9 @@ static void dotest(void (*testcase_fn)(v { unsigned long saved_preempt_count = preempt_count(); int expected_failure = 0; +#if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_DEBUG_RT_MUTEXES) + atomic_t saved_lock_count = current->lock_count; +#endif WARN_ON(irqs_disabled()); @@ -989,6 +999,9 @@ static void dotest(void (*testcase_fn)(v #endif reset_locks(); +#if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_DEBUG_RT_MUTEXES) + current->lock_count = saved_lock_count; +#endif } static inline void print_testname(const char *testname) @@ -998,7 +1011,7 @@ static inline void print_testname(const #define DO_TESTCASE_1(desc, name, nr) \ print_testname(desc"/"#nr); \ - dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ + dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); #define DO_TESTCASE_1B(desc, name, nr) \ @@ -1006,17 +1019,17 @@ static inline void print_testname(const dotest(name##_##nr, FAILURE, LOCKTYPE_RWLOCK); \ printk("\n"); -#define DO_TESTCASE_3(desc, name, nr) \ - print_testname(desc"/"#nr); \ - dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN); \ - dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ +#define DO_TESTCASE_3(desc, name, nr) \ + print_testname(desc"/"#nr); \ + dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN); \ + dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); -#define DO_TESTCASE_3RW(desc, name, nr) \ - print_testname(desc"/"#nr); \ +#define DO_TESTCASE_3RW(desc, name, nr) \ + print_testname(desc"/"#nr); \ dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN|LOCKTYPE_RWLOCK);\ - dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ + dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); @@ -1047,7 +1060,7 @@ static inline void print_testname(const print_testname(desc); \ dotest(name##_spin, FAILURE, LOCKTYPE_SPIN); \ dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK); \ - dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK); \ + dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK); \ dotest(name##_mutex, FAILURE, LOCKTYPE_MUTEX); \ dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM); \ dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM); \ @@ -1179,6 +1192,7 @@ void locking_selftest(void) /* * irq-context testcases: */ +#ifndef CONFIG_PREEMPT_RT DO_TESTCASE_2x6("irqs-on + irq-safe-A", irqsafe1); DO_TESTCASE_2x3("sirq-safe-A => hirqs-on", irqsafe2A); DO_TESTCASE_2x6("safe-A + irqs-on", irqsafe2B); @@ -1188,6 +1202,7 @@ void locking_selftest(void) DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion); // DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2); +#endif if (unexpected_testcase_failures) { printk("-----------------------------------------------------------------\n"); Index: linux-2.6-tip/lib/plist.c =================================================================== --- linux-2.6-tip.orig/lib/plist.c +++ linux-2.6-tip/lib/plist.c @@ -54,7 +54,9 @@ static void plist_check_list(struct list static void plist_check_head(struct plist_head *head) { +#ifndef CONFIG_PREEMPT_RT WARN_ON(!head->lock); +#endif if (head->lock) WARN_ON_SMP(!spin_is_locked(head->lock)); plist_check_list(&head->prio_list); Index: linux-2.6-tip/lib/rwsem-spinlock.c =================================================================== --- linux-2.6-tip.orig/lib/rwsem-spinlock.c +++ linux-2.6-tip/lib/rwsem-spinlock.c @@ -20,7 +20,7 @@ struct rwsem_waiter { /* * initialise the semaphore */ -void __init_rwsem(struct rw_semaphore *sem, const char *name, +void __compat_init_rwsem(struct compat_rw_semaphore *sem, const char *name, struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -44,8 +44,8 @@ void __init_rwsem(struct rw_semaphore *s * - woken process blocks are discarded from the list after having task zeroed * - writers are only woken if wakewrite is non-zero */ -static inline struct rw_semaphore * -__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) +static inline struct compat_rw_semaphore * +__rwsem_do_wake(struct compat_rw_semaphore *sem, int wakewrite) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -103,8 +103,8 @@ __rwsem_do_wake(struct rw_semaphore *sem /* * wake a single writer */ -static inline struct rw_semaphore * -__rwsem_wake_one_writer(struct rw_semaphore *sem) +static inline struct compat_rw_semaphore * +__rwsem_wake_one_writer(struct compat_rw_semaphore *sem) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -125,7 +125,7 @@ __rwsem_wake_one_writer(struct rw_semaph /* * get a read lock on the semaphore */ -void __sched __down_read(struct rw_semaphore *sem) +void __sched __down_read(struct compat_rw_semaphore *sem) { struct rwsem_waiter waiter; struct task_struct *tsk; @@ -168,7 +168,7 @@ void __sched __down_read(struct rw_semap /* * trylock for reading -- returns 1 if successful, 0 if contention */ -int __down_read_trylock(struct rw_semaphore *sem) +int __down_read_trylock(struct compat_rw_semaphore *sem) { unsigned long flags; int ret = 0; @@ -191,7 +191,8 @@ int __down_read_trylock(struct rw_semaph * get a write lock on the semaphore * - we increment the waiting count anyway to indicate an exclusive lock */ -void __sched __down_write_nested(struct rw_semaphore *sem, int subclass) +void __sched +__down_write_nested(struct compat_rw_semaphore *sem, int subclass) { struct rwsem_waiter waiter; struct task_struct *tsk; @@ -231,7 +232,7 @@ void __sched __down_write_nested(struct ; } -void __sched __down_write(struct rw_semaphore *sem) +void __sched __down_write(struct compat_rw_semaphore *sem) { __down_write_nested(sem, 0); } @@ -239,7 +240,7 @@ void __sched __down_write(struct rw_sema /* * trylock for writing -- returns 1 if successful, 0 if contention */ -int __down_write_trylock(struct rw_semaphore *sem) +int __down_write_trylock(struct compat_rw_semaphore *sem) { unsigned long flags; int ret = 0; @@ -260,7 +261,7 @@ int __down_write_trylock(struct rw_semap /* * release a read lock on the semaphore */ -void __up_read(struct rw_semaphore *sem) +void __up_read(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -275,7 +276,7 @@ void __up_read(struct rw_semaphore *sem) /* * release a write lock on the semaphore */ -void __up_write(struct rw_semaphore *sem) +void __up_write(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -292,7 +293,7 @@ void __up_write(struct rw_semaphore *sem * downgrade a write lock into a read lock * - just wake up any readers at the front of the queue */ -void __downgrade_write(struct rw_semaphore *sem) +void __downgrade_write(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -305,7 +306,7 @@ void __downgrade_write(struct rw_semapho spin_unlock_irqrestore(&sem->wait_lock, flags); } -EXPORT_SYMBOL(__init_rwsem); +EXPORT_SYMBOL(__compat_init_rwsem); EXPORT_SYMBOL(__down_read); EXPORT_SYMBOL(__down_read_trylock); EXPORT_SYMBOL(__down_write_nested); Index: linux-2.6-tip/lib/rwsem.c =================================================================== --- linux-2.6-tip.orig/lib/rwsem.c +++ linux-2.6-tip/lib/rwsem.c @@ -11,8 +11,8 @@ /* * Initialize an rwsem: */ -void __init_rwsem(struct rw_semaphore *sem, const char *name, - struct lock_class_key *key) +void __compat_init_rwsem(struct rw_semaphore *sem, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -26,7 +26,7 @@ void __init_rwsem(struct rw_semaphore *s INIT_LIST_HEAD(&sem->wait_list); } -EXPORT_SYMBOL(__init_rwsem); +EXPORT_SYMBOL(__compat_init_rwsem); struct rwsem_waiter { struct list_head list; Index: linux-2.6-tip/lib/spinlock_debug.c =================================================================== --- linux-2.6-tip.orig/lib/spinlock_debug.c +++ linux-2.6-tip/lib/spinlock_debug.c @@ -13,8 +13,8 @@ #include #include -void __spin_lock_init(spinlock_t *lock, const char *name, - struct lock_class_key *key) +void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -23,16 +23,16 @@ void __spin_lock_init(spinlock_t *lock, debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - lock->raw_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + lock->raw_lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; lock->magic = SPINLOCK_MAGIC; lock->owner = SPINLOCK_OWNER_INIT; lock->owner_cpu = -1; } -EXPORT_SYMBOL(__spin_lock_init); +EXPORT_SYMBOL(__raw_spin_lock_init); -void __rwlock_init(rwlock_t *lock, const char *name, - struct lock_class_key *key) +void __raw_rwlock_init(raw_rwlock_t *lock, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -41,15 +41,15 @@ void __rwlock_init(rwlock_t *lock, const debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - lock->raw_lock = (raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED; + lock->raw_lock = (__raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED; lock->magic = RWLOCK_MAGIC; lock->owner = SPINLOCK_OWNER_INIT; lock->owner_cpu = -1; } -EXPORT_SYMBOL(__rwlock_init); +EXPORT_SYMBOL(__raw_rwlock_init); -static void spin_bug(spinlock_t *lock, const char *msg) +static void spin_bug(raw_spinlock_t *lock, const char *msg) { struct task_struct *owner = NULL; @@ -73,7 +73,7 @@ static void spin_bug(spinlock_t *lock, c #define SPIN_BUG_ON(cond, lock, msg) if (unlikely(cond)) spin_bug(lock, msg) static inline void -debug_spin_lock_before(spinlock_t *lock) +debug_spin_lock_before(raw_spinlock_t *lock) { SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic"); SPIN_BUG_ON(lock->owner == current, lock, "recursion"); @@ -81,13 +81,13 @@ debug_spin_lock_before(spinlock_t *lock) lock, "cpu recursion"); } -static inline void debug_spin_lock_after(spinlock_t *lock) +static inline void debug_spin_lock_after(raw_spinlock_t *lock) { lock->owner_cpu = raw_smp_processor_id(); lock->owner = current; } -static inline void debug_spin_unlock(spinlock_t *lock) +static inline void debug_spin_unlock(raw_spinlock_t *lock) { SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic"); SPIN_BUG_ON(!spin_is_locked(lock), lock, "already unlocked"); @@ -98,7 +98,7 @@ static inline void debug_spin_unlock(spi lock->owner_cpu = -1; } -static void __spin_lock_debug(spinlock_t *lock) +static void __spin_lock_debug(raw_spinlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -125,7 +125,7 @@ static void __spin_lock_debug(spinlock_t } } -void _raw_spin_lock(spinlock_t *lock) +void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) { debug_spin_lock_before(lock); if (unlikely(!__raw_spin_trylock(&lock->raw_lock))) @@ -133,7 +133,7 @@ void _raw_spin_lock(spinlock_t *lock) debug_spin_lock_after(lock); } -int _raw_spin_trylock(spinlock_t *lock) +int __lockfunc _raw_spin_trylock(raw_spinlock_t *lock) { int ret = __raw_spin_trylock(&lock->raw_lock); @@ -148,13 +148,13 @@ int _raw_spin_trylock(spinlock_t *lock) return ret; } -void _raw_spin_unlock(spinlock_t *lock) +void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock) { debug_spin_unlock(lock); __raw_spin_unlock(&lock->raw_lock); } -static void rwlock_bug(rwlock_t *lock, const char *msg) +static void rwlock_bug(raw_rwlock_t *lock, const char *msg) { if (!debug_locks_off()) return; @@ -167,8 +167,8 @@ static void rwlock_bug(rwlock_t *lock, c #define RWLOCK_BUG_ON(cond, lock, msg) if (unlikely(cond)) rwlock_bug(lock, msg) -#if 0 /* __write_lock_debug() can lock up - maybe this can too? */ -static void __read_lock_debug(rwlock_t *lock) +#if 1 /* __write_lock_debug() can lock up - maybe this can too? */ +static void __raw_read_lock_debug(raw_rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -193,13 +193,13 @@ static void __read_lock_debug(rwlock_t * } #endif -void _raw_read_lock(rwlock_t *lock) +void __lockfunc _raw_read_lock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); - __raw_read_lock(&lock->raw_lock); + __raw_read_lock_debug(lock); } -int _raw_read_trylock(rwlock_t *lock) +int __lockfunc _raw_read_trylock(raw_rwlock_t *lock) { int ret = __raw_read_trylock(&lock->raw_lock); @@ -212,13 +212,13 @@ int _raw_read_trylock(rwlock_t *lock) return ret; } -void _raw_read_unlock(rwlock_t *lock) +void __lockfunc _raw_read_unlock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); __raw_read_unlock(&lock->raw_lock); } -static inline void debug_write_lock_before(rwlock_t *lock) +static inline void debug_write_lock_before(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); RWLOCK_BUG_ON(lock->owner == current, lock, "recursion"); @@ -226,13 +226,13 @@ static inline void debug_write_lock_befo lock, "cpu recursion"); } -static inline void debug_write_lock_after(rwlock_t *lock) +static inline void debug_write_lock_after(raw_rwlock_t *lock) { lock->owner_cpu = raw_smp_processor_id(); lock->owner = current; } -static inline void debug_write_unlock(rwlock_t *lock) +static inline void debug_write_unlock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); RWLOCK_BUG_ON(lock->owner != current, lock, "wrong owner"); @@ -242,8 +242,8 @@ static inline void debug_write_unlock(rw lock->owner_cpu = -1; } -#if 0 /* This can cause lockups */ -static void __write_lock_debug(rwlock_t *lock) +#if 1 /* This can cause lockups */ +static void __raw_write_lock_debug(raw_rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -268,14 +268,14 @@ static void __write_lock_debug(rwlock_t } #endif -void _raw_write_lock(rwlock_t *lock) +void __lockfunc _raw_write_lock(raw_rwlock_t *lock) { debug_write_lock_before(lock); - __raw_write_lock(&lock->raw_lock); + __raw_write_lock_debug(lock); debug_write_lock_after(lock); } -int _raw_write_trylock(rwlock_t *lock) +int __lockfunc _raw_write_trylock(raw_rwlock_t *lock) { int ret = __raw_write_trylock(&lock->raw_lock); @@ -290,7 +290,7 @@ int _raw_write_trylock(rwlock_t *lock) return ret; } -void _raw_write_unlock(rwlock_t *lock) +void __lockfunc _raw_write_unlock(raw_rwlock_t *lock) { debug_write_unlock(lock); __raw_write_unlock(&lock->raw_lock); Index: linux-2.6-tip/include/linux/irqflags.h =================================================================== --- linux-2.6-tip.orig/include/linux/irqflags.h +++ linux-2.6-tip/include/linux/irqflags.h @@ -13,6 +13,9 @@ #include +/* dummy wrapper for now: */ +#define BUILD_CHECK_IRQ_FLAGS(flags) + #ifdef CONFIG_TRACE_IRQFLAGS extern void trace_softirqs_on(unsigned long ip); extern void trace_softirqs_off(unsigned long ip); Index: linux-2.6-tip/drivers/media/dvb/dvb-core/dvb_frontend.c =================================================================== --- linux-2.6-tip.orig/drivers/media/dvb/dvb-core/dvb_frontend.c +++ linux-2.6-tip/drivers/media/dvb/dvb-core/dvb_frontend.c @@ -101,7 +101,7 @@ struct dvb_frontend_private { struct dvb_device *dvbdev; struct dvb_frontend_parameters parameters; struct dvb_fe_events events; - struct semaphore sem; + struct compat_semaphore sem; struct list_head list_head; wait_queue_head_t wait_queue; struct task_struct *thread; Index: linux-2.6-tip/drivers/net/3c527.c =================================================================== --- linux-2.6-tip.orig/drivers/net/3c527.c +++ linux-2.6-tip/drivers/net/3c527.c @@ -181,7 +181,7 @@ struct mc32_local u16 rx_ring_tail; /* index to rx de-queue end */ - struct semaphore cmd_mutex; /* Serialises issuing of execute commands */ + struct compat_semaphore cmd_mutex; /* Serialises issuing of execute commands */ struct completion execution_cmd; /* Card has completed an execute command */ struct completion xceiver_cmd; /* Card has completed a tx or rx command */ }; Index: linux-2.6-tip/drivers/net/hamradio/6pack.c =================================================================== --- linux-2.6-tip.orig/drivers/net/hamradio/6pack.c +++ linux-2.6-tip/drivers/net/hamradio/6pack.c @@ -120,7 +120,7 @@ struct sixpack { struct timer_list tx_t; struct timer_list resync_t; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; spinlock_t lock; }; Index: linux-2.6-tip/drivers/net/hamradio/mkiss.c =================================================================== --- linux-2.6-tip.orig/drivers/net/hamradio/mkiss.c +++ linux-2.6-tip/drivers/net/hamradio/mkiss.c @@ -84,7 +84,7 @@ struct mkiss { #define CRC_MODE_SMACK_TEST 4 atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; }; /*---------------------------------------------------------------------------*/ Index: linux-2.6-tip/drivers/net/ppp_async.c =================================================================== --- linux-2.6-tip.orig/drivers/net/ppp_async.c +++ linux-2.6-tip/drivers/net/ppp_async.c @@ -67,7 +67,7 @@ struct asyncppp { struct tasklet_struct tsk; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; struct ppp_channel chan; /* interface to generic ppp layer */ unsigned char obuf[OBUFSIZE]; }; Index: linux-2.6-tip/drivers/pci/hotplug/ibmphp_hpc.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/hotplug/ibmphp_hpc.c +++ linux-2.6-tip/drivers/pci/hotplug/ibmphp_hpc.c @@ -104,7 +104,7 @@ static int to_debug = 0; static struct mutex sem_hpcaccess; // lock access to HPC static struct semaphore semOperations; // lock all operations and // access to data structures -static struct semaphore sem_exit; // make sure polling thread goes away +static struct compat_semaphore sem_exit; // make sure polling thread goes away static struct task_struct *ibmphp_poll_thread; //---------------------------------------------------------------------------- // local function prototypes Index: linux-2.6-tip/drivers/scsi/aacraid/aacraid.h =================================================================== --- linux-2.6-tip.orig/drivers/scsi/aacraid/aacraid.h +++ linux-2.6-tip/drivers/scsi/aacraid/aacraid.h @@ -719,7 +719,7 @@ struct aac_fib_context { u32 unique; // unique value representing this context ulong jiffies; // used for cleanup - dmb changed to ulong struct list_head next; // used to link context's into a linked list - struct semaphore wait_sem; // this is used to wait for the next fib to arrive. + struct compat_semaphore wait_sem; // this is used to wait for the next fib to arrive. int wait; // Set to true when thread is in WaitForSingleObject unsigned long count; // total number of FIBs on FibList struct list_head fib_list; // this holds fibs and their attachd hw_fibs @@ -789,7 +789,7 @@ struct fib { * This is the event the sendfib routine will wait on if the * caller did not pass one and this is synch io. */ - struct semaphore event_wait; + struct compat_semaphore event_wait; spinlock_t event_lock; u32 done; /* gets set to 1 when fib is complete */ Index: linux-2.6-tip/include/linux/parport.h =================================================================== --- linux-2.6-tip.orig/include/linux/parport.h +++ linux-2.6-tip/include/linux/parport.h @@ -264,7 +264,7 @@ enum ieee1284_phase { struct ieee1284_info { int mode; volatile enum ieee1284_phase phase; - struct semaphore irq; + struct compat_semaphore irq; }; /* A parallel port */ Index: linux-2.6-tip/include/asm-generic/tlb.h =================================================================== --- linux-2.6-tip.orig/include/asm-generic/tlb.h +++ linux-2.6-tip/include/asm-generic/tlb.h @@ -41,11 +41,12 @@ struct mmu_gather { unsigned int nr; /* set to ~0U means fast mode */ unsigned int need_flush;/* Really unmapped some ptes? */ unsigned int fullmm; /* non-zero means full mm flush */ + int cpu; struct page * pages[FREE_PTE_NR]; }; /* Users of the generic TLB shootdown code must declare this storage space. */ -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); +DECLARE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* tlb_gather_mmu * Return a pointer to an initialized struct mmu_gather. @@ -53,8 +54,10 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_g static inline struct mmu_gather * tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); + int cpu; + struct mmu_gather *tlb = &get_cpu_var_locked(mmu_gathers, &cpu); + tlb->cpu = cpu; tlb->mm = mm; /* Use fast mode if only one CPU is online */ @@ -90,7 +93,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u /* keep the page table cache within bounds */ check_pgt_cache(); - put_cpu_var(mmu_gathers); + put_cpu_var_locked(mmu_gathers, tlb->cpu); } /* tlb_remove_page Index: linux-2.6-tip/mm/swap.c =================================================================== --- linux-2.6-tip.orig/mm/swap.c +++ linux-2.6-tip/mm/swap.c @@ -30,14 +30,49 @@ #include #include #include +#include #include "internal.h" /* How many pages do we try to swap or page in/out together? */ int page_cluster; -static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs); -static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); +/* + * On PREEMPT_RT we don't want to disable preemption for cpu variables. + * We grab a cpu and then use that cpu to lock the variables accordingly. + * + * (On !PREEMPT_RT this turns into normal preempt-off sections, as before.) + */ +static DEFINE_PER_CPU_LOCKED(struct pagevec[NR_LRU_LISTS], lru_add_pvecs); +static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_rotate_pvecs); + +#define swap_get_cpu_var_irq_save(var, flags, cpu) \ + ({ \ + (void)flags; \ + &get_cpu_var_locked(var, &cpu); \ + }) + +#define swap_put_cpu_var_irq_restore(var, flags, cpu) \ + put_cpu_var_locked(var, cpu) + +#define swap_get_cpu_var(var, cpu) \ + &get_cpu_var_locked(var, &cpu) + +#define swap_put_cpu_var(var, cpu) \ + put_cpu_var_locked(var, cpu) + +#define swap_per_cpu_lock(var, cpu) \ + ({ \ + spin_lock(&__get_cpu_lock(var, cpu)); \ + &__get_cpu_var_locked(var, cpu); \ + }) + +#define swap_per_cpu_unlock(var, cpu) \ + spin_unlock(&__get_cpu_lock(var, cpu)); + +#define swap_get_cpu() raw_smp_processor_id() + +#define swap_put_cpu() /* * This path almost never happens for VM activity - pages are normally @@ -141,13 +176,13 @@ void rotate_reclaimable_page(struct pag !PageUnevictable(page) && PageLRU(page)) { struct pagevec *pvec; unsigned long flags; + int cpu; page_cache_get(page); - local_irq_save(flags); - pvec = &__get_cpu_var(lru_rotate_pvecs); + pvec = swap_get_cpu_var_irq_save(lru_rotate_pvecs, flags, cpu); if (!pagevec_add(pvec, page)) pagevec_move_tail(pvec); - local_irq_restore(flags); + swap_put_cpu_var_irq_restore(lru_rotate_pvecs, flags, cpu); } } @@ -216,12 +251,14 @@ EXPORT_SYMBOL(mark_page_accessed); void __lru_cache_add(struct page *page, enum lru_list lru) { - struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru]; + struct pagevec *pvec; + int cpu; + pvec = swap_get_cpu_var(lru_add_pvecs, cpu)[lru]; page_cache_get(page); if (!pagevec_add(pvec, page)) ____pagevec_lru_add(pvec, lru); - put_cpu_var(lru_add_pvecs); + swap_put_cpu_var(lru_add_pvecs, cpu); } /** @@ -271,31 +308,36 @@ void add_page_to_unevictable_list(struct */ static void drain_cpu_pagevecs(int cpu) { - struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); - struct pagevec *pvec; + struct pagevec *pvecs, *pvec; int lru; + pvecs = swap_per_cpu_lock(lru_add_pvecs, cpu)[0]; for_each_lru(lru) { pvec = &pvecs[lru - LRU_BASE]; if (pagevec_count(pvec)) ____pagevec_lru_add(pvec, lru); } + swap_per_cpu_unlock(lru_add_pvecs, cpu); - pvec = &per_cpu(lru_rotate_pvecs, cpu); + pvec = swap_per_cpu_lock(lru_rotate_pvecs, cpu); if (pagevec_count(pvec)) { unsigned long flags; /* No harm done if a racing interrupt already did this */ - local_irq_save(flags); + local_irq_save_nort(flags); pagevec_move_tail(pvec); - local_irq_restore(flags); + local_irq_restore_nort(flags); } + swap_per_cpu_unlock(lru_rotate_pvecs, cpu); } void lru_add_drain(void) { - drain_cpu_pagevecs(get_cpu()); - put_cpu(); + int cpu; + + cpu = swap_get_cpu(); + drain_cpu_pagevecs(cpu); + swap_put_cpu(); } static void lru_add_drain_per_cpu(struct work_struct *dummy) @@ -369,7 +411,7 @@ void release_pages(struct page **pages, } __pagevec_free(&pages_to_free); pagevec_reinit(&pages_to_free); - } + } } if (zone) spin_unlock_irqrestore(&zone->lru_lock, flags); Index: linux-2.6-tip/net/ipv4/netfilter/arp_tables.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/netfilter/arp_tables.c +++ linux-2.6-tip/net/ipv4/netfilter/arp_tables.c @@ -239,7 +239,7 @@ unsigned int arpt_do_table(struct sk_buf read_lock_bh(&table->lock); private = table->private; - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); back = get_entry(table_base, private->underflow[hook]); @@ -1157,7 +1157,7 @@ static int do_add_counters(struct net *n i = 0; /* Choose the copy that is on our node */ - loc_cpu_entry = private->entries[smp_processor_id()]; + loc_cpu_entry = private->entries[raw_smp_processor_id()]; ARPT_ENTRY_ITERATE(loc_cpu_entry, private->size, add_counter_to_entry, Index: linux-2.6-tip/net/ipv4/netfilter/ip_tables.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/netfilter/ip_tables.c +++ linux-2.6-tip/net/ipv4/netfilter/ip_tables.c @@ -350,7 +350,7 @@ ipt_do_table(struct sk_buff *skb, read_lock_bh(&table->lock); IP_NF_ASSERT(table->valid_hooks & (1 << hook)); private = table->private; - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); /* For return from builtin chain */ Index: linux-2.6-tip/net/core/dev.c =================================================================== --- linux-2.6-tip.orig/net/core/dev.c +++ linux-2.6-tip/net/core/dev.c @@ -1878,42 +1878,52 @@ gso: Check this and shot the lock. It is not prone from deadlocks. Either shot noqueue qdisc, it is even simpler 8) */ - if (dev->flags & IFF_UP) { - int cpu = smp_processor_id(); /* ok because BHs are off */ + if (!(dev->flags & IFF_UP)) + goto err; - if (txq->xmit_lock_owner != cpu) { + /* Recursion is detected! It is possible, unfortunately: */ + if (netif_tx_lock_recursion(txq)) + goto err_recursion; - HARD_TX_LOCK(dev, txq, cpu); + HARD_TX_LOCK(dev, txq); - if (!netif_tx_queue_stopped(txq)) { - rc = 0; - if (!dev_hard_start_xmit(skb, dev, txq)) { - HARD_TX_UNLOCK(dev, txq); - goto out; - } - } - HARD_TX_UNLOCK(dev, txq); - if (net_ratelimit()) - printk(KERN_CRIT "Virtual device %s asks to " - "queue packet!\n", dev->name); - } else { - /* Recursion is detected! It is possible, - * unfortunately */ - if (net_ratelimit()) - printk(KERN_CRIT "Dead loop on virtual device " - "%s, fix it urgently!\n", dev->name); - } + if (netif_tx_queue_stopped(txq)) + goto err_tx_unlock; + + if (dev_hard_start_xmit(skb, dev, txq)) + goto err_tx_unlock; + + rc = 0; + HARD_TX_UNLOCK(dev, txq); + +out: + rcu_read_unlock_bh(); + return rc; + +err_recursion: + if (net_ratelimit()) { + printk(KERN_CRIT + "Dead loop on virtual device %s, fix it urgently!\n", + dev->name); + } + goto err; + +err_tx_unlock: + HARD_TX_UNLOCK(dev, txq); + + if (net_ratelimit()) { + printk(KERN_CRIT "Virtual device %s asks to queue packet!\n", + dev->name); } + /* Fall through: */ +err: rc = -ENETDOWN; rcu_read_unlock_bh(); out_kfree_skb: kfree_skb(skb); return rc; -out: - rcu_read_unlock_bh(); - return rc; } @@ -1986,8 +1996,8 @@ int netif_rx_ni(struct sk_buff *skb) { int err; - preempt_disable(); err = netif_rx(skb); + preempt_disable(); if (local_softirq_pending()) do_softirq(); preempt_enable(); @@ -1999,7 +2009,8 @@ EXPORT_SYMBOL(netif_rx_ni); static void net_tx_action(struct softirq_action *h) { - struct softnet_data *sd = &__get_cpu_var(softnet_data); + struct softnet_data *sd = &per_cpu(softnet_data, + raw_smp_processor_id()); if (sd->completion_queue) { struct sk_buff *clist; @@ -2015,6 +2026,11 @@ static void net_tx_action(struct softirq WARN_ON(atomic_read(&skb->users)); __kfree_skb(skb); + /* + * Safe to reschedule - the list is private + * at this point. + */ + cond_resched_softirq_context(); } } @@ -2033,6 +2049,22 @@ static void net_tx_action(struct softirq head = head->next_sched; root_lock = qdisc_lock(q); + /* + * We are executing in softirq context here, and + * if softirqs are preemptible, we must avoid + * infinite reactivation of the softirq by + * either the tx handler, or by netif_schedule(). + * (it would result in an infinitely looping + * softirq context) + * So we take the spinlock unconditionally. + */ +#ifdef CONFIG_PREEMPT_SOFTIRQS + spin_lock(root_lock); + smp_mb__before_clear_bit(); + clear_bit(__QDISC_STATE_SCHED, &q->state); + qdisc_run(q); + spin_unlock(root_lock); +#else if (spin_trylock(root_lock)) { smp_mb__before_clear_bit(); clear_bit(__QDISC_STATE_SCHED, @@ -2049,6 +2081,7 @@ static void net_tx_action(struct softirq &q->state); } } +#endif } } } @@ -2257,7 +2290,7 @@ int netif_receive_skb(struct sk_buff *sk skb->dev = orig_dev->master; } - __get_cpu_var(netdev_rx_stat).total++; + per_cpu(netdev_rx_stat, raw_smp_processor_id()).total++; skb_reset_network_header(skb); skb_reset_transport_header(skb); @@ -2578,9 +2611,10 @@ EXPORT_SYMBOL(napi_gro_frags); static int process_backlog(struct napi_struct *napi, int quota) { int work = 0; - struct softnet_data *queue = &__get_cpu_var(softnet_data); + struct softnet_data *queue; unsigned long start_time = jiffies; + queue = &per_cpu(softnet_data, raw_smp_processor_id()); napi->weight = weight_p; do { struct sk_buff *skb; @@ -2614,7 +2648,7 @@ void __napi_schedule(struct napi_struct local_irq_save(flags); list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list); - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags); } EXPORT_SYMBOL(__napi_schedule); @@ -2762,7 +2796,7 @@ out: softnet_break: __get_cpu_var(netdev_rx_stat).time_squeeze++; - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out; } @@ -4233,7 +4267,7 @@ static void __netdev_init_queue_locks_on { spin_lock_init(&dev_queue->_xmit_lock); netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type); - dev_queue->xmit_lock_owner = -1; + dev_queue->xmit_lock_owner = (void *)-1; } static void netdev_init_queue_locks(struct net_device *dev) Index: linux-2.6-tip/fs/buffer.c =================================================================== --- linux-2.6-tip.orig/fs/buffer.c +++ linux-2.6-tip/fs/buffer.c @@ -40,7 +40,6 @@ #include #include #include -#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -469,8 +468,7 @@ static void end_buffer_async_read(struct * decide that the page is now completely done. */ first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @@ -483,8 +481,7 @@ static void end_buffer_async_read(struct } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors and they are all @@ -496,8 +493,7 @@ static void end_buffer_async_read(struct return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @@ -532,8 +528,7 @@ static void end_buffer_async_write(struc } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_write(bh); unlock_buffer(bh); @@ -545,14 +540,12 @@ static void end_buffer_async_write(struc } tmp = tmp->b_this_page; } - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); end_page_writeback(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @@ -3302,6 +3295,8 @@ struct buffer_head *alloc_buffer_head(gf struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags); if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); + spin_lock_init(&ret->b_uptodate_lock); + spin_lock_init(&ret->b_state_lock); get_cpu_var(bh_accounting).nr++; recalc_bh_state(); put_cpu_var(bh_accounting); @@ -3313,6 +3308,8 @@ EXPORT_SYMBOL(alloc_buffer_head); void free_buffer_head(struct buffer_head *bh) { BUG_ON(!list_empty(&bh->b_assoc_buffers)); + BUG_ON(spin_is_locked(&bh->b_uptodate_lock)); + BUG_ON(spin_is_locked(&bh->b_state_lock)); kmem_cache_free(bh_cachep, bh); get_cpu_var(bh_accounting).nr--; recalc_bh_state(); Index: linux-2.6-tip/fs/ntfs/aops.c =================================================================== --- linux-2.6-tip.orig/fs/ntfs/aops.c +++ linux-2.6-tip/fs/ntfs/aops.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "aops.h" #include "attrib.h" @@ -107,8 +108,7 @@ static void ntfs_end_buffer_async_read(s "0x%llx.", (unsigned long long)bh->b_blocknr); } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @@ -123,8 +123,7 @@ static void ntfs_end_buffer_async_read(s } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors then we can set the page uptodate, * but we first have to perform the post read mst fixups, if the @@ -145,13 +144,13 @@ static void ntfs_end_buffer_async_read(s recs = PAGE_CACHE_SIZE / rec_size; /* Should have been verified before we got here... */ BUG_ON(!recs); - local_irq_save(flags); + local_irq_save_nort(flags); kaddr = kmap_atomic(page, KM_BIO_SRC_IRQ); for (i = 0; i < recs; i++) post_read_mst_fixup((NTFS_RECORD*)(kaddr + i * rec_size), rec_size); kunmap_atomic(kaddr, KM_BIO_SRC_IRQ); - local_irq_restore(flags); + local_irq_restore_nort(flags); flush_dcache_page(page); if (likely(page_uptodate && !PageError(page))) SetPageUptodate(page); @@ -159,8 +158,7 @@ static void ntfs_end_buffer_async_read(s unlock_page(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } Index: linux-2.6-tip/include/linux/buffer_head.h =================================================================== --- linux-2.6-tip.orig/include/linux/buffer_head.h +++ linux-2.6-tip/include/linux/buffer_head.h @@ -21,10 +21,6 @@ enum bh_state_bits { BH_Dirty, /* Is dirty */ BH_Lock, /* Is locked */ BH_Req, /* Has been submitted for I/O */ - BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise - * IO completion of other buffers in the page - */ - BH_Mapped, /* Has a disk mapping */ BH_New, /* Disk mapping was newly created by get_block */ BH_Async_Read, /* Is under end_buffer_async_read I/O */ @@ -74,6 +70,8 @@ struct buffer_head { struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head */ + spinlock_t b_uptodate_lock; + spinlock_t b_state_lock; }; /* Index: linux-2.6-tip/include/linux/jbd.h =================================================================== --- linux-2.6-tip.orig/include/linux/jbd.h +++ linux-2.6-tip/include/linux/jbd.h @@ -260,6 +260,15 @@ void buffer_assertion_failure(struct buf #define J_ASSERT_JH(jh, expr) J_ASSERT(expr) #endif +/* + * For assertions that are only valid on SMP (e.g. spin_is_locked()): + */ +#ifdef CONFIG_SMP +# define J_ASSERT_JH_SMP(jh, expr) J_ASSERT_JH(jh, expr) +#else +# define J_ASSERT_JH_SMP(jh, assert) do { } while (0) +#endif + #if defined(JBD_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) @@ -315,32 +324,32 @@ static inline struct journal_head *bh2jh static inline void jbd_lock_bh_state(struct buffer_head *bh) { - bit_spin_lock(BH_State, &bh->b_state); + spin_lock(&bh->b_state_lock); } static inline int jbd_trylock_bh_state(struct buffer_head *bh) { - return bit_spin_trylock(BH_State, &bh->b_state); + return spin_trylock(&bh->b_state_lock); } static inline int jbd_is_locked_bh_state(struct buffer_head *bh) { - return bit_spin_is_locked(BH_State, &bh->b_state); + return spin_is_locked(&bh->b_state_lock); } static inline void jbd_unlock_bh_state(struct buffer_head *bh) { - bit_spin_unlock(BH_State, &bh->b_state); + spin_unlock(&bh->b_state_lock); } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { - bit_spin_lock(BH_JournalHead, &bh->b_state); + spin_lock_irq(&bh->b_uptodate_lock); } static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh) { - bit_spin_unlock(BH_JournalHead, &bh->b_state); + spin_unlock_irq(&bh->b_uptodate_lock); } struct jbd_revoke_table_s; Index: linux-2.6-tip/fs/jbd/transaction.c =================================================================== --- linux-2.6-tip.orig/fs/jbd/transaction.c +++ linux-2.6-tip/fs/jbd/transaction.c @@ -1582,7 +1582,7 @@ static void __journal_temp_unlink_buffer transaction_t *transaction; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); transaction = jh->b_transaction; if (transaction) assert_spin_locked(&transaction->t_journal->j_list_lock); @@ -2077,7 +2077,7 @@ void __journal_file_buffer(struct journa int was_dirty = 0; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); assert_spin_locked(&transaction->t_journal->j_list_lock); J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); @@ -2166,7 +2166,7 @@ void __journal_refile_buffer(struct jour int was_dirty; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); if (jh->b_transaction) assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock); Index: linux-2.6-tip/fs/proc/stat.c =================================================================== --- linux-2.6-tip.orig/fs/proc/stat.c +++ linux-2.6-tip/fs/proc/stat.c @@ -23,13 +23,14 @@ static int show_stat(struct seq_file *p, { int i, j; unsigned long jif; - cputime64_t user, nice, system, idle, iowait, irq, softirq, steal; + cputime64_t user_rt, user, nice, system_rt, system, idle, + iowait, irq, softirq, steal; cputime64_t guest; u64 sum = 0; struct timespec boottime; unsigned int per_irq_sum; - user = nice = system = idle = iowait = + user_rt = user = nice = system_rt = system = idle = iowait = irq = softirq = steal = cputime64_zero; guest = cputime64_zero; getboottime(&boottime); @@ -44,6 +45,8 @@ static int show_stat(struct seq_file *p, irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq); softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq); steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal); + user_rt = cputime64_add(user_rt, kstat_cpu(i).cpustat.user_rt); + system_rt = cputime64_add(system_rt, kstat_cpu(i).cpustat.system_rt); guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest); for_each_irq_nr(j) { sum += kstat_irqs_cpu(j, i); @@ -52,7 +55,10 @@ static int show_stat(struct seq_file *p, } sum += arch_irq_stat(); - seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", + user = cputime64_add(user_rt, user); + system = cputime64_add(system_rt, system); + + seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", (unsigned long long)cputime64_to_clock_t(user), (unsigned long long)cputime64_to_clock_t(nice), (unsigned long long)cputime64_to_clock_t(system), @@ -61,13 +67,17 @@ static int show_stat(struct seq_file *p, (unsigned long long)cputime64_to_clock_t(irq), (unsigned long long)cputime64_to_clock_t(softirq), (unsigned long long)cputime64_to_clock_t(steal), + (unsigned long long)cputime64_to_clock_t(user_rt), + (unsigned long long)cputime64_to_clock_t(system_rt), (unsigned long long)cputime64_to_clock_t(guest)); for_each_online_cpu(i) { /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kstat_cpu(i).cpustat.user; + user_rt = kstat_cpu(i).cpustat.user_rt; + system_rt = kstat_cpu(i).cpustat.system_rt; + user = cputime64_add(user_rt, kstat_cpu(i).cpustat.user); nice = kstat_cpu(i).cpustat.nice; - system = kstat_cpu(i).cpustat.system; + system = cputime64_add(system_rt, kstat_cpu(i).cpustat.system); idle = kstat_cpu(i).cpustat.idle; iowait = kstat_cpu(i).cpustat.iowait; irq = kstat_cpu(i).cpustat.irq; @@ -75,7 +85,7 @@ static int show_stat(struct seq_file *p, steal = kstat_cpu(i).cpustat.steal; guest = kstat_cpu(i).cpustat.guest; seq_printf(p, - "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", + "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", i, (unsigned long long)cputime64_to_clock_t(user), (unsigned long long)cputime64_to_clock_t(nice), @@ -85,6 +95,8 @@ static int show_stat(struct seq_file *p, (unsigned long long)cputime64_to_clock_t(irq), (unsigned long long)cputime64_to_clock_t(softirq), (unsigned long long)cputime64_to_clock_t(steal), + (unsigned long long)cputime64_to_clock_t(user_rt), + (unsigned long long)cputime64_to_clock_t(system_rt), (unsigned long long)cputime64_to_clock_t(guest)); } seq_printf(p, "intr %llu", (unsigned long long)sum); Index: linux-2.6-tip/kernel/posix-cpu-timers.c =================================================================== --- linux-2.6-tip.orig/kernel/posix-cpu-timers.c +++ linux-2.6-tip/kernel/posix-cpu-timers.c @@ -557,7 +557,7 @@ static void arm_timer(struct k_itimer *t p->cpu_timers : p->signal->cpu_timers); head += CPUCLOCK_WHICH(timer->it_clock); - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); spin_lock(&p->sighand->siglock); listpos = head; @@ -745,7 +745,7 @@ int posix_cpu_timer_set(struct k_itimer /* * Disarm any old timer after extracting its expiry time. */ - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); ret = 0; spin_lock(&p->sighand->siglock); @@ -1378,12 +1378,11 @@ static inline int fastpath_timer_check(s * already updated our counts. We need to check if any timers fire now. * Interrupts are disabled. */ -void run_posix_cpu_timers(struct task_struct *tsk) +void __run_posix_cpu_timers(struct task_struct *tsk) { LIST_HEAD(firing); struct k_itimer *timer, *next; - BUG_ON(!irqs_disabled()); /* * The fast path checks that there are no expired thread or thread @@ -1435,6 +1434,162 @@ void run_posix_cpu_timers(struct task_st } } +#include +#include +DEFINE_PER_CPU(struct task_struct *, posix_timer_task); +DEFINE_PER_CPU(struct task_struct *, posix_timer_tasklist); + +static int posix_cpu_timers_thread(void *data) +{ + int cpu = (long)data; + + BUG_ON(per_cpu(posix_timer_task,cpu) != current); + + while (!kthread_should_stop()) { + struct task_struct *tsk = NULL; + struct task_struct *next = NULL; + + if (cpu_is_offline(cpu)) + goto wait_to_die; + + /* grab task list */ + raw_local_irq_disable(); + tsk = per_cpu(posix_timer_tasklist, cpu); + per_cpu(posix_timer_tasklist, cpu) = NULL; + raw_local_irq_enable(); + + /* its possible the list is empty, just return */ + if (!tsk) { + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + __set_current_state(TASK_RUNNING); + continue; + } + + /* Process task list */ + while (1) { + /* save next */ + next = tsk->posix_timer_list; + + /* run the task timers, clear its ptr and + * unreference it + */ + __run_posix_cpu_timers(tsk); + tsk->posix_timer_list = NULL; + put_task_struct(tsk); + + /* check if this is the last on the list */ + if (next == tsk) + break; + tsk = next; + } + } + return 0; + +wait_to_die: + /* Wait for kthread_stop */ + set_current_state(TASK_INTERRUPTIBLE); + while (!kthread_should_stop()) { + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +void run_posix_cpu_timers(struct task_struct *tsk) +{ + unsigned long cpu = smp_processor_id(); + struct task_struct *tasklist; + + BUG_ON(!irqs_disabled()); + if(!per_cpu(posix_timer_task, cpu)) + return; + /* get per-cpu references */ + tasklist = per_cpu(posix_timer_tasklist, cpu); + + /* check to see if we're already queued */ + if (!tsk->posix_timer_list) { + get_task_struct(tsk); + if (tasklist) { + tsk->posix_timer_list = tasklist; + } else { + /* + * The list is terminated by a self-pointing + * task_struct + */ + tsk->posix_timer_list = tsk; + } + per_cpu(posix_timer_tasklist, cpu) = tsk; + } + /* XXX signal the thread somehow */ + wake_up_process(per_cpu(posix_timer_task, cpu)); +} + +/* + * posix_cpu_thread_call - callback that gets triggered when a CPU is added. + * Here we can start up the necessary migration thread for the new CPU. + */ +static int posix_cpu_thread_call(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + int cpu = (long)hcpu; + struct task_struct *p; + struct sched_param param; + + switch (action) { + case CPU_UP_PREPARE: + p = kthread_create(posix_cpu_timers_thread, hcpu, + "posixcputmr/%d",cpu); + if (IS_ERR(p)) + return NOTIFY_BAD; + p->flags |= PF_NOFREEZE; + kthread_bind(p, cpu); + /* Must be high prio to avoid getting starved */ + param.sched_priority = MAX_RT_PRIO-1; + sched_setscheduler(p, SCHED_FIFO, ¶m); + per_cpu(posix_timer_task,cpu) = p; + break; + case CPU_ONLINE: + /* Strictly unneccessary, as first user will wake it. */ + wake_up_process(per_cpu(posix_timer_task,cpu)); + break; +#ifdef CONFIG_HOTPLUG_CPU + case CPU_UP_CANCELED: + /* Unbind it from offline cpu so it can run. Fall thru. */ + kthread_bind(per_cpu(posix_timer_task,cpu), + any_online_cpu(cpu_online_map)); + kthread_stop(per_cpu(posix_timer_task,cpu)); + per_cpu(posix_timer_task,cpu) = NULL; + break; + case CPU_DEAD: + kthread_stop(per_cpu(posix_timer_task,cpu)); + per_cpu(posix_timer_task,cpu) = NULL; + break; +#endif + } + return NOTIFY_OK; +} + +/* Register at highest priority so that task migration (migrate_all_tasks) + * happens before everything else. + */ +static struct notifier_block __devinitdata posix_cpu_thread_notifier = { + .notifier_call = posix_cpu_thread_call, + .priority = 10 +}; + +static int __init posix_cpu_thread_init(void) +{ + void *cpu = (void *)(long)smp_processor_id(); + /* Start one for boot CPU. */ + posix_cpu_thread_call(&posix_cpu_thread_notifier, CPU_UP_PREPARE, cpu); + posix_cpu_thread_call(&posix_cpu_thread_notifier, CPU_ONLINE, cpu); + register_cpu_notifier(&posix_cpu_thread_notifier); + return 0; +} +early_initcall(posix_cpu_thread_init); + /* * Set one of the process-wide special case CPU timers. * The tsk->sighand->siglock must be held by the caller. @@ -1700,6 +1855,12 @@ static __init int init_posix_cpu_timers( .nsleep = thread_cpu_nsleep, .nsleep_restart = thread_cpu_nsleep_restart, }; + unsigned long cpu; + + /* init the per-cpu posix_timer_tasklets */ + for_each_cpu_mask(cpu, cpu_possible_map) { + per_cpu(posix_timer_tasklist, cpu) = NULL; + } register_posix_clock(CLOCK_PROCESS_CPUTIME_ID, &process); register_posix_clock(CLOCK_THREAD_CPUTIME_ID, &thread); Index: linux-2.6-tip/drivers/net/3c59x.c =================================================================== --- linux-2.6-tip.orig/drivers/net/3c59x.c +++ linux-2.6-tip/drivers/net/3c59x.c @@ -791,9 +791,9 @@ static void poll_vortex(struct net_devic { struct vortex_private *vp = netdev_priv(dev); unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); (vp->full_bus_master_rx ? boomerang_interrupt:vortex_interrupt)(dev->irq,dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #endif @@ -1739,6 +1739,7 @@ vortex_timer(unsigned long data) int next_tick = 60*HZ; int ok = 0; int media_status, old_window; + unsigned long flags; if (vortex_debug > 2) { printk(KERN_DEBUG "%s: Media selection timer tick happened, %s.\n", @@ -1746,7 +1747,7 @@ vortex_timer(unsigned long data) printk(KERN_DEBUG "dev->watchdog_timeo=%d\n", dev->watchdog_timeo); } - disable_irq_lockdep(dev->irq); + spin_lock_irqsave(&vp->lock, flags); old_window = ioread16(ioaddr + EL3_CMD) >> 13; EL3WINDOW(4); media_status = ioread16(ioaddr + Wn4_Media); @@ -1769,10 +1770,7 @@ vortex_timer(unsigned long data) case XCVR_MII: case XCVR_NWAY: { ok = 1; - /* Interrupts are already disabled */ - spin_lock(&vp->lock); vortex_check_media(dev, 0); - spin_unlock(&vp->lock); } break; default: /* Other media types handled by Tx timeouts. */ @@ -1828,7 +1826,7 @@ leave_media_alone: dev->name, media_tbl[dev->if_port].name); EL3WINDOW(old_window); - enable_irq_lockdep(dev->irq); + spin_unlock_irqrestore(&vp->lock, flags); mod_timer(&vp->timer, RUN_AT(next_tick)); if (vp->deferred) iowrite16(FakeIntr, ioaddr + EL3_CMD); @@ -1862,12 +1860,12 @@ static void vortex_tx_timeout(struct net * Block interrupts because vortex_interrupt does a bare spin_lock() */ unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); if (vp->full_bus_master_tx) boomerang_interrupt(dev->irq, dev); else vortex_interrupt(dev->irq, dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); } } Index: linux-2.6-tip/drivers/serial/8250.c =================================================================== --- linux-2.6-tip.orig/drivers/serial/8250.c +++ linux-2.6-tip/drivers/serial/8250.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -1546,7 +1547,10 @@ static irqreturn_t serial8250_interrupt( { struct irq_info *i = dev_id; struct list_head *l, *end = NULL; - int pass_counter = 0, handled = 0; +#ifndef CONFIG_PREEMPT_RT + int pass_counter = 0; +#endif + int handled = 0; DEBUG_INTR("serial8250_interrupt(%d)...", irq); @@ -1584,12 +1588,18 @@ static irqreturn_t serial8250_interrupt( l = l->next; + /* + * On preempt-rt we can be preempted and run in our + * own thread. + */ +#ifndef CONFIG_PREEMPT_RT if (l == i->head && pass_counter++ > PASS_LIMIT) { /* If we hit this, we're dead. */ printk(KERN_ERR "serial8250: too much work for " "irq%d\n", irq); break; } +#endif } while (l != end); spin_unlock(&i->lock); @@ -2707,14 +2717,10 @@ serial8250_console_write(struct console touch_nmi_watchdog(); - local_irq_save(flags); - if (up->port.sysrq) { - /* serial8250_handle_port() already took the lock */ - locked = 0; - } else if (oops_in_progress) { - locked = spin_trylock(&up->port.lock); - } else - spin_lock(&up->port.lock); + if (up->port.sysrq || oops_in_progress || preempt_rt) + locked = spin_trylock_irqsave(&up->port.lock, flags); + else + spin_lock_irqsave(&up->port.lock, flags); /* * First save the IER then disable the interrupts @@ -2746,8 +2752,7 @@ serial8250_console_write(struct console check_modem_status(up); if (locked) - spin_unlock(&up->port.lock); - local_irq_restore(flags); + spin_unlock_irqrestore(&up->port.lock, flags); } static int __init serial8250_console_setup(struct console *co, char *options) Index: linux-2.6-tip/drivers/char/tty_buffer.c =================================================================== --- linux-2.6-tip.orig/drivers/char/tty_buffer.c +++ linux-2.6-tip/drivers/char/tty_buffer.c @@ -482,10 +482,14 @@ void tty_flip_buffer_push(struct tty_str tty->buf.tail->commit = tty->buf.tail->used; spin_unlock_irqrestore(&tty->buf.lock, flags); +#ifndef CONFIG_PREEMPT_RT if (tty->low_latency) flush_to_ldisc(&tty->buf.work.work); else schedule_delayed_work(&tty->buf.work, 1); +#else + flush_to_ldisc(&tty->buf.work.work); +#endif } EXPORT_SYMBOL(tty_flip_buffer_push); Index: linux-2.6-tip/arch/x86/include/asm/vgtod.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/vgtod.h +++ linux-2.6-tip/arch/x86/include/asm/vgtod.h @@ -5,7 +5,7 @@ #include struct vsyscall_gtod_data { - seqlock_t lock; + raw_seqlock_t lock; /* open coded 'struct timespec' */ time_t wall_time_sec; Index: linux-2.6-tip/arch/Kconfig =================================================================== --- linux-2.6-tip.orig/arch/Kconfig +++ linux-2.6-tip/arch/Kconfig @@ -32,6 +32,11 @@ config OPROFILE_IBS config HAVE_OPROFILE bool +config PROFILE_NMI + bool + depends on OPROFILE + default y + config KPROBES bool "Kprobes" depends on KALLSYMS && MODULES Index: linux-2.6-tip/arch/x86/include/asm/highmem.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/highmem.h +++ linux-2.6-tip/arch/x86/include/asm/highmem.h @@ -58,6 +58,17 @@ extern void *kmap_high(struct page *page extern void kunmap_high(struct page *page); void *kmap(struct page *page); +extern void kunmap_virt(void *ptr); +extern struct page *kmap_to_page(void *ptr); +void kunmap(struct page *page); + +void *__kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot); +void *__kmap_atomic(struct page *page, enum km_type type); +void *__kmap_atomic_direct(struct page *page, enum km_type type); +void __kunmap_atomic(void *kvaddr, enum km_type type); +void *__kmap_atomic_pfn(unsigned long pfn, enum km_type type); +struct page *__kmap_atomic_to_page(void *ptr); + void kunmap(struct page *page); void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot); void *kmap_atomic(struct page *page, enum km_type type); @@ -66,7 +77,8 @@ void *kmap_atomic_pfn(unsigned long pfn, struct page *kmap_atomic_to_page(void *ptr); #ifndef CONFIG_PARAVIRT -#define kmap_atomic_pte(page, type) kmap_atomic(page, type) +#define kmap_atomic_pte(page, type) kmap_atomic(page, type) +#define kmap_atomic_pte_direct(page, type) kmap_atomic_direct(page, type) #endif #define flush_cache_kmaps() do { } while (0) @@ -74,6 +86,27 @@ struct page *kmap_atomic_to_page(void *p extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn, unsigned long end_pfn); +/* + * on PREEMPT_RT kmap_atomic() is a wrapper that uses kmap(): + */ +#ifdef CONFIG_PREEMPT_RT +# define kmap_atomic_prot(page, type, prot) ({ pagefault_disable(); kmap(page); }) +# define kmap_atomic(page, type) ({ pagefault_disable(); kmap(page); }) +# define kmap_atomic_pfn(pfn, type) kmap(pfn_to_page(pfn)) +# define kunmap_atomic(kvaddr, type) do { pagefault_enable(); kunmap_virt(kvaddr); } while(0) +# define kmap_atomic_to_page(kvaddr) kmap_to_page(kvaddr) +# define kmap_atomic_direct(page, type) __kmap_atomic_direct(page, type) +# define kunmap_atomic_direct(kvaddr, type) __kunmap_atomic(kvaddr, type) +#else +# define kmap_atomic_prot(page, type, prot) __kmap_atomic_prot(page, type, prot) +# define kmap_atomic(page, type) __kmap_atomic(page, type) +# define kmap_atomic_pfn(pfn, type) __kmap_atomic_pfn(pfn, type) +# define kunmap_atomic(kvaddr, type) __kunmap_atomic(kvaddr, type) +# define kmap_atomic_to_page(kvaddr) __kmap_atomic_to_page(kvaddr) +# define kmap_atomic_direct(page, type) __kmap_atomic(page, type) +# define kunmap_atomic_direct(kvaddr, type) __kunmap_atomic(kvaddr, type) +#endif + #endif /* __KERNEL__ */ #endif /* _ASM_X86_HIGHMEM_H */ Index: linux-2.6-tip/arch/x86/include/asm/i8253.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/i8253.h +++ linux-2.6-tip/arch/x86/include/asm/i8253.h @@ -6,7 +6,7 @@ #define PIT_CH0 0x40 #define PIT_CH2 0x42 -extern spinlock_t i8253_lock; +extern raw_spinlock_t i8253_lock; extern struct clock_event_device *global_clock_event; Index: linux-2.6-tip/arch/x86/include/asm/pci_x86.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/pci_x86.h +++ linux-2.6-tip/arch/x86/include/asm/pci_x86.h @@ -83,7 +83,7 @@ struct irq_routing_table { extern unsigned int pcibios_irq_mask; extern int pcibios_scanned; -extern spinlock_t pci_config_lock; +extern raw_spinlock_t pci_config_lock; extern int (*pcibios_enable_irq)(struct pci_dev *dev); extern void (*pcibios_disable_irq)(struct pci_dev *dev); Index: linux-2.6-tip/arch/x86/include/asm/xor_32.h =================================================================== --- linux-2.6-tip.orig/arch/x86/include/asm/xor_32.h +++ linux-2.6-tip/arch/x86/include/asm/xor_32.h @@ -865,7 +865,21 @@ static struct xor_block_template xor_blo #include #undef XOR_TRY_TEMPLATES -#define XOR_TRY_TEMPLATES \ +/* + * MMX/SSE ops disable preemption for long periods of time, + * so on PREEMPT_RT use the register-based ops only: + */ +#ifdef CONFIG_PREEMPT_RT +# define XOR_TRY_TEMPLATES \ + do { \ + xor_speed(&xor_block_8regs); \ + xor_speed(&xor_block_8regs_p); \ + xor_speed(&xor_block_32regs); \ + xor_speed(&xor_block_32regs_p); \ + } while (0) +# define XOR_SELECT_TEMPLATE(FASTEST) (FASTEST) +#else +# define XOR_TRY_TEMPLATES \ do { \ xor_speed(&xor_block_8regs); \ xor_speed(&xor_block_8regs_p); \ @@ -882,7 +896,8 @@ do { \ /* We force the use of the SSE xor block because it can write around L2. We may also be able to load into the L1 only depending on how the cpu deals with a load to a line that is being prefetched. */ -#define XOR_SELECT_TEMPLATE(FASTEST) \ +# define XOR_SELECT_TEMPLATE(FASTEST) \ (cpu_has_xmm ? &xor_block_pIII_sse : FASTEST) +#endif /* CONFIG_PREEMPT_RT */ #endif /* _ASM_X86_XOR_32_H */ Index: linux-2.6-tip/arch/x86/kernel/cpu/mtrr/generic.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/cpu/mtrr/generic.c +++ linux-2.6-tip/arch/x86/kernel/cpu/mtrr/generic.c @@ -486,7 +486,7 @@ static unsigned long set_mtrr_state(void static unsigned long cr4 = 0; -static DEFINE_SPINLOCK(set_atomicity_lock); +static DEFINE_RAW_SPINLOCK(set_atomicity_lock); /* * Since we are disabling the cache don't allow any interrupts - they Index: linux-2.6-tip/arch/x86/kernel/dumpstack_32.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/dumpstack_32.c +++ linux-2.6-tip/arch/x86/kernel/dumpstack_32.c @@ -93,6 +93,12 @@ show_stack_log_lvl(struct task_struct *t } +#if defined(CONFIG_DEBUG_STACKOVERFLOW) && defined(CONFIG_EVENT_TRACE) +extern unsigned long worst_stack_left; +#else +# define worst_stack_left -1L +#endif + void show_registers(struct pt_regs *regs) { int i; Index: linux-2.6-tip/arch/x86/kernel/i8253.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/i8253.c +++ linux-2.6-tip/arch/x86/kernel/i8253.c @@ -15,7 +15,7 @@ #include #include -DEFINE_SPINLOCK(i8253_lock); +DEFINE_RAW_SPINLOCK(i8253_lock); EXPORT_SYMBOL(i8253_lock); #ifdef CONFIG_X86_32 Index: linux-2.6-tip/arch/x86/kernel/microcode_amd.c =================================================================== --- linux-2.6-tip.orig/arch/x86/kernel/microcode_amd.c +++ linux-2.6-tip/arch/x86/kernel/microcode_amd.c @@ -81,7 +81,7 @@ struct microcode_amd { #define UCODE_CONTAINER_HEADER_SIZE 12 /* serialize access to the physical write */ -static DEFINE_SPINLOCK(microcode_update_lock); +static DEFINE_RAW_SPINLOCK(microcode_update_lock); static struct equiv_cpu_entry *equiv_cpu_table; Index: linux-2.6-tip/arch/x86/pci/common.c =================================================================== --- linux-2.6-tip.orig/arch/x86/pci/common.c +++ linux-2.6-tip/arch/x86/pci/common.c @@ -81,7 +81,7 @@ int pcibios_scanned; * This interrupt-safe spinlock protects all accesses to PCI * configuration space. */ -DEFINE_SPINLOCK(pci_config_lock); +DEFINE_RAW_SPINLOCK(pci_config_lock); static int __devinit can_skip_ioresource_align(const struct dmi_system_id *d) { Index: linux-2.6-tip/arch/x86/pci/direct.c =================================================================== --- linux-2.6-tip.orig/arch/x86/pci/direct.c +++ linux-2.6-tip/arch/x86/pci/direct.c @@ -223,16 +223,23 @@ static int __init pci_check_type1(void) unsigned int tmp; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x01, 0xCFB); tmp = inl(0xCF8); outl(0x80000000, 0xCF8); - if (inl(0xCF8) == 0x80000000 && pci_sanity_check(&pci_direct_conf1)) { - works = 1; + + if (inl(0xCF8) == 0x80000000) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf1)) + works = 1; + + spin_lock_irqsave(&pci_config_lock, flags); } outl(tmp, 0xCF8); - local_irq_restore(flags); + + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } @@ -242,17 +249,19 @@ static int __init pci_check_type2(void) unsigned long flags; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x00, 0xCFB); outb(0x00, 0xCF8); outb(0x00, 0xCFA); - if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00 && - pci_sanity_check(&pci_direct_conf2)) { - works = 1; - } - local_irq_restore(flags); + if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf2)) + works = 1; + } else + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } Index: linux-2.6-tip/kernel/sched_cpupri.h =================================================================== --- linux-2.6-tip.orig/kernel/sched_cpupri.h +++ linux-2.6-tip/kernel/sched_cpupri.h @@ -12,7 +12,7 @@ /* values 2-101 are RT priorities 0-99 */ struct cpupri_vec { - spinlock_t lock; + raw_spinlock_t lock; int count; cpumask_var_t mask; }; Index: linux-2.6-tip/include/linux/profile.h =================================================================== --- linux-2.6-tip.orig/include/linux/profile.h +++ linux-2.6-tip/include/linux/profile.h @@ -4,14 +4,16 @@ #include #include #include +#include #include #include -#define CPU_PROFILING 1 -#define SCHED_PROFILING 2 -#define SLEEP_PROFILING 3 -#define KVM_PROFILING 4 +#define CPU_PROFILING 1 +#define SCHED_PROFILING 2 +#define SLEEP_PROFILING 3 +#define KVM_PROFILING 4 +#define PREEMPT_PROFILING 5 struct proc_dir_entry; struct pt_regs; @@ -36,6 +38,8 @@ enum profile_type { PROFILE_MUNMAP }; +extern int prof_pid; + #ifdef CONFIG_PROFILING extern int prof_on __read_mostly; Index: linux-2.6-tip/include/linux/radix-tree.h =================================================================== --- linux-2.6-tip.orig/include/linux/radix-tree.h +++ linux-2.6-tip/include/linux/radix-tree.h @@ -167,7 +167,18 @@ radix_tree_gang_lookup_slot(struct radix unsigned long first_index, unsigned int max_items); unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); +/* + * On a mutex based kernel we can freely schedule within the radix code: + */ +#ifdef CONFIG_PREEMPT_RT +static inline int radix_tree_preload(gfp_t gfp_mask) +{ + return 0; +} +#else int radix_tree_preload(gfp_t gfp_mask); +#endif + void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); @@ -187,7 +198,9 @@ int radix_tree_tagged(struct radix_tree_ static inline void radix_tree_preload_end(void) { +#ifndef CONFIG_PREEMPT_RT preempt_enable(); +#endif } #endif /* _LINUX_RADIX_TREE_H */ Index: linux-2.6-tip/include/linux/smp_lock.h =================================================================== --- linux-2.6-tip.orig/include/linux/smp_lock.h +++ linux-2.6-tip/include/linux/smp_lock.h @@ -45,7 +45,7 @@ static inline void cycle_kernel_lock(voi #define unlock_kernel() do { } while(0) #define release_kernel_lock(task) do { } while(0) #define cycle_kernel_lock() do { } while(0) -#define reacquire_kernel_lock(task) 0 +#define reacquire_kernel_lock(task) do { } while(0) #define kernel_locked() 1 #endif /* CONFIG_LOCK_KERNEL */ Index: linux-2.6-tip/include/linux/workqueue.h =================================================================== --- linux-2.6-tip.orig/include/linux/workqueue.h +++ linux-2.6-tip/include/linux/workqueue.h @@ -190,6 +190,9 @@ __create_workqueue_key(const char *name, #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0) #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0) +extern void set_workqueue_prio(struct workqueue_struct *wq, int policy, + int rt_priority, int nice); + extern void destroy_workqueue(struct workqueue_struct *wq); extern int queue_work(struct workqueue_struct *wq, struct work_struct *work); Index: linux-2.6-tip/kernel/notifier.c =================================================================== --- linux-2.6-tip.orig/kernel/notifier.c +++ linux-2.6-tip/kernel/notifier.c @@ -71,7 +71,7 @@ static int notifier_chain_unregister(str * @returns: notifier_call_chain returns the value returned by the * last notifier function called. */ -static int __kprobes notifier_call_chain(struct notifier_block **nl, +static int __kprobes notrace notifier_call_chain(struct notifier_block **nl, unsigned long val, void *v, int nr_to_call, int *nr_calls) { @@ -217,7 +217,7 @@ int blocking_notifier_chain_register(str * not yet working and interrupts must remain disabled. At * such times we must not call down_write(). */ - if (unlikely(system_state == SYSTEM_BOOTING)) + if (unlikely(system_state < SYSTEM_RUNNING)) return notifier_chain_register(&nh->head, n); down_write(&nh->rwsem); Index: linux-2.6-tip/kernel/signal.c =================================================================== --- linux-2.6-tip.orig/kernel/signal.c +++ linux-2.6-tip/kernel/signal.c @@ -821,7 +821,9 @@ static int send_signal(int sig, struct s trace_sched_signal_send(sig, t); +#ifdef CONFIG_SMP assert_spin_locked(&t->sighand->siglock); +#endif if (!prepare_signal(sig, t)) return 0; @@ -1576,6 +1578,7 @@ static void ptrace_stop(int exit_code, i if (may_ptrace_stop()) { do_notify_parent_cldstop(current, CLD_TRAPPED); read_unlock(&tasklist_lock); + current->flags &= ~PF_NOSCHED; schedule(); } else { /* @@ -1644,6 +1647,7 @@ finish_stop(int stop_count) } do { + current->flags &= ~PF_NOSCHED; schedule(); } while (try_to_freeze()); /* Index: linux-2.6-tip/kernel/user.c =================================================================== --- linux-2.6-tip.orig/kernel/user.c +++ linux-2.6-tip/kernel/user.c @@ -299,14 +299,14 @@ static void remove_user_sysfs_dir(struct */ uids_mutex_lock(); - local_irq_save(flags); + local_irq_save_nort(flags); if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) { uid_hash_remove(up); remove_user = 1; spin_unlock_irqrestore(&uidhash_lock, flags); } else { - local_irq_restore(flags); + local_irq_restore_nort(flags); } if (!remove_user) @@ -387,11 +387,11 @@ void free_uid(struct user_struct *up) if (!up) return; - local_irq_save(flags); + local_irq_save_nort(flags); if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) free_user(up, flags); else - local_irq_restore(flags); + local_irq_restore_nort(flags); } struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) Index: linux-2.6-tip/lib/radix-tree.c =================================================================== --- linux-2.6-tip.orig/lib/radix-tree.c +++ linux-2.6-tip/lib/radix-tree.c @@ -157,12 +157,14 @@ radix_tree_node_alloc(struct radix_tree_ * succeed in getting a node here (and never reach * kmem_cache_alloc) */ + rtp = &get_cpu_var(radix_tree_preloads); rtp = &__get_cpu_var(radix_tree_preloads); if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; rtp->nodes[rtp->nr - 1] = NULL; rtp->nr--; } + put_cpu_var(radix_tree_preloads); } if (ret == NULL) ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); @@ -195,6 +197,8 @@ radix_tree_node_free(struct radix_tree_n call_rcu(&node->rcu_head, radix_tree_node_rcu_free); } +#ifndef CONFIG_PREEMPT_RT + /* * Load up this CPU's radix_tree_node buffer with sufficient objects to * ensure that the addition of a single element in the tree cannot fail. On @@ -227,6 +231,8 @@ out: } EXPORT_SYMBOL(radix_tree_preload); +#endif + /* * Return the maximum key which can be store into a * radix tree with height HEIGHT. Index: linux-2.6-tip/net/core/sock.c =================================================================== --- linux-2.6-tip.orig/net/core/sock.c +++ linux-2.6-tip/net/core/sock.c @@ -1948,8 +1948,9 @@ static DECLARE_BITMAP(proto_inuse_idx, P #ifdef CONFIG_NET_NS void sock_prot_inuse_add(struct net *net, struct proto *prot, int val) { - int cpu = smp_processor_id(); + int cpu = get_cpu(); per_cpu_ptr(net->core.inuse, cpu)->val[prot->inuse_idx] += val; + put_cpu(); } EXPORT_SYMBOL_GPL(sock_prot_inuse_add); @@ -1995,7 +1996,9 @@ static DEFINE_PER_CPU(struct prot_inuse, void sock_prot_inuse_add(struct net *net, struct proto *prot, int val) { - __get_cpu_var(prot_inuse).val[prot->inuse_idx] += val; + int cpu = get_cpu(); + per_cpu(prot_inuse, cpu).val[prot->inuse_idx] += val; + put_cpu(); } EXPORT_SYMBOL_GPL(sock_prot_inuse_add); Index: linux-2.6-tip/net/ipv4/proc.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/proc.c +++ linux-2.6-tip/net/ipv4/proc.c @@ -54,8 +54,8 @@ static int sockstat_seq_show(struct seq_ int orphans, sockets; local_bh_disable(); - orphans = percpu_counter_sum_positive(&tcp_orphan_count), - sockets = percpu_counter_sum_positive(&tcp_sockets_allocated), + orphans = percpu_counter_sum_positive(&tcp_orphan_count); + sockets = percpu_counter_sum_positive(&tcp_sockets_allocated); local_bh_enable(); socket_seq_show(seq); Index: linux-2.6-tip/include/linux/netdevice.h =================================================================== --- linux-2.6-tip.orig/include/linux/netdevice.h +++ linux-2.6-tip/include/linux/netdevice.h @@ -439,7 +439,7 @@ struct netdev_queue { struct Qdisc *qdisc; unsigned long state; spinlock_t _xmit_lock; - int xmit_lock_owner; + void *xmit_lock_owner; struct Qdisc *qdisc_sleeping; } ____cacheline_aligned_in_smp; @@ -1624,35 +1624,43 @@ static inline void netif_rx_complete(str napi_complete(napi); } -static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu) +static inline void __netif_tx_lock(struct netdev_queue *txq) { spin_lock(&txq->_xmit_lock); - txq->xmit_lock_owner = cpu; + txq->xmit_lock_owner = (void *)current; +} + +/* + * Do we hold the xmit_lock already? + */ +static inline int netif_tx_lock_recursion(struct netdev_queue *txq) +{ + return txq->xmit_lock_owner == (void *)current; } static inline void __netif_tx_lock_bh(struct netdev_queue *txq) { spin_lock_bh(&txq->_xmit_lock); - txq->xmit_lock_owner = smp_processor_id(); + txq->xmit_lock_owner = (void *)current; } static inline int __netif_tx_trylock(struct netdev_queue *txq) { int ok = spin_trylock(&txq->_xmit_lock); if (likely(ok)) - txq->xmit_lock_owner = smp_processor_id(); + txq->xmit_lock_owner = (void *)current; return ok; } static inline void __netif_tx_unlock(struct netdev_queue *txq) { - txq->xmit_lock_owner = -1; + txq->xmit_lock_owner = (void *)-1; spin_unlock(&txq->_xmit_lock); } static inline void __netif_tx_unlock_bh(struct netdev_queue *txq) { - txq->xmit_lock_owner = -1; + txq->xmit_lock_owner = (void *)-1; spin_unlock_bh(&txq->_xmit_lock); } @@ -1665,10 +1673,8 @@ static inline void __netif_tx_unlock_bh( static inline void netif_tx_lock(struct net_device *dev) { unsigned int i; - int cpu; spin_lock(&dev->tx_global_lock); - cpu = smp_processor_id(); for (i = 0; i < dev->num_tx_queues; i++) { struct netdev_queue *txq = netdev_get_tx_queue(dev, i); @@ -1678,7 +1684,7 @@ static inline void netif_tx_lock(struct * the ->hard_start_xmit() handler and already * checked the frozen bit. */ - __netif_tx_lock(txq, cpu); + __netif_tx_lock(txq); set_bit(__QUEUE_STATE_FROZEN, &txq->state); __netif_tx_unlock(txq); } @@ -1714,9 +1720,9 @@ static inline void netif_tx_unlock_bh(st local_bh_enable(); } -#define HARD_TX_LOCK(dev, txq, cpu) { \ +#define HARD_TX_LOCK(dev, txq) { \ if ((dev->features & NETIF_F_LLTX) == 0) { \ - __netif_tx_lock(txq, cpu); \ + __netif_tx_lock(txq); \ } \ } @@ -1729,14 +1735,12 @@ static inline void netif_tx_unlock_bh(st static inline void netif_tx_disable(struct net_device *dev) { unsigned int i; - int cpu; local_bh_disable(); - cpu = smp_processor_id(); for (i = 0; i < dev->num_tx_queues; i++) { struct netdev_queue *txq = netdev_get_tx_queue(dev, i); - __netif_tx_lock(txq, cpu); + __netif_tx_lock(txq); netif_tx_stop_queue(txq); __netif_tx_unlock(txq); } Index: linux-2.6-tip/include/net/dn_dev.h =================================================================== --- linux-2.6-tip.orig/include/net/dn_dev.h +++ linux-2.6-tip/include/net/dn_dev.h @@ -76,9 +76,9 @@ struct dn_dev_parms { int priority; /* Priority to be a router */ char *name; /* Name for sysctl */ int ctl_name; /* Index for sysctl */ - int (*up)(struct net_device *); - void (*down)(struct net_device *); - void (*timer3)(struct net_device *, struct dn_ifaddr *ifa); + int (*dn_up)(struct net_device *); + void (*dn_down)(struct net_device *); + void (*dn_timer3)(struct net_device *, struct dn_ifaddr *ifa); void *sysctl; }; Index: linux-2.6-tip/net/core/netpoll.c =================================================================== --- linux-2.6-tip.orig/net/core/netpoll.c +++ linux-2.6-tip/net/core/netpoll.c @@ -68,20 +68,20 @@ static void queue_process(struct work_st txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); - local_irq_save(flags); - __netif_tx_lock(txq, smp_processor_id()); + local_irq_save_nort(flags); + __netif_tx_lock(txq); if (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq) || ops->ndo_start_xmit(skb, dev) != NETDEV_TX_OK) { skb_queue_head(&npinfo->txq, skb); __netif_tx_unlock(txq); - local_irq_restore(flags); + local_irq_restore_nort(flags); schedule_delayed_work(&npinfo->tx_work, HZ/10); return; } __netif_tx_unlock(txq); - local_irq_restore(flags); + local_irq_restore_nort(flags); } } @@ -151,7 +151,7 @@ static void poll_napi(struct net_device int budget = 16; list_for_each_entry(napi, &dev->napi_list, dev_list) { - if (napi->poll_owner != smp_processor_id() && + if (napi->poll_owner != raw_smp_processor_id() && spin_trylock(&napi->poll_lock)) { budget = poll_one_napi(dev->npinfo, napi, budget); spin_unlock(&napi->poll_lock); @@ -208,30 +208,35 @@ static void refill_skbs(void) static void zap_completion_queue(void) { - unsigned long flags; struct softnet_data *sd = &get_cpu_var(softnet_data); + struct sk_buff *clist = NULL; + unsigned long flags; if (sd->completion_queue) { - struct sk_buff *clist; local_irq_save(flags); clist = sd->completion_queue; sd->completion_queue = NULL; local_irq_restore(flags); - - while (clist != NULL) { - struct sk_buff *skb = clist; - clist = clist->next; - if (skb->destructor) { - atomic_inc(&skb->users); - dev_kfree_skb_any(skb); /* put this one back */ - } else { - __kfree_skb(skb); - } - } } + + /* + * Took the list private, can drop our softnet + * reference: + */ put_cpu_var(softnet_data); + + while (clist != NULL) { + struct sk_buff *skb = clist; + clist = clist->next; + if (skb->destructor) { + atomic_inc(&skb->users); + dev_kfree_skb_any(skb); /* put this one back */ + } else { + __kfree_skb(skb); + } + } } static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) @@ -239,13 +244,26 @@ static struct sk_buff *find_skb(struct n int count = 0; struct sk_buff *skb; +#ifdef CONFIG_PREEMPT_RT + /* + * On -rt skb_pool.lock is schedulable, so if we are + * in an atomic context we just try to dequeue from the + * pool and fail if we cannot get one. + */ + if (in_atomic() || irqs_disabled()) + goto pick_atomic; +#endif zap_completion_queue(); refill_skbs(); repeat: skb = alloc_skb(len, GFP_ATOMIC); - if (!skb) + if (!skb) { +#ifdef CONFIG_PREEMPT_RT +pick_atomic: +#endif skb = skb_dequeue(&skb_pool); + } if (!skb) { if (++count < 10) { @@ -265,7 +283,7 @@ static int netpoll_owner_active(struct n struct napi_struct *napi; list_for_each_entry(napi, &dev->napi_list, dev_list) { - if (napi->poll_owner == smp_processor_id()) + if (napi->poll_owner == raw_smp_processor_id()) return 1; } return 0; @@ -291,7 +309,7 @@ static void netpoll_send_skb(struct netp txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); - local_irq_save(flags); + local_irq_save_nort(flags); /* try until next clock tick */ for (tries = jiffies_to_usecs(1)/USEC_PER_POLL; tries > 0; --tries) { @@ -310,7 +328,7 @@ static void netpoll_send_skb(struct netp udelay(USEC_PER_POLL); } - local_irq_restore(flags); + local_irq_restore_nort(flags); } if (status != NETDEV_TX_OK) { @@ -731,7 +749,7 @@ int netpoll_setup(struct netpoll *np) np->name); break; } - cond_resched(); + schedule_timeout_uninterruptible(1); } /* If carrier appears to come up instantly, we don't Index: linux-2.6-tip/net/decnet/dn_dev.c =================================================================== --- linux-2.6-tip.orig/net/decnet/dn_dev.c +++ linux-2.6-tip/net/decnet/dn_dev.c @@ -90,9 +90,9 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ethernet", .ctl_name = NET_DECNET_CONF_ETHER, - .up = dn_eth_up, - .down = dn_eth_down, - .timer3 = dn_send_brd_hello, + .dn_up = dn_eth_up, + .dn_down = dn_eth_down, + .dn_timer3 = dn_send_brd_hello, }, { .type = ARPHRD_IPGRE, /* DECnet tunneled over GRE in IP */ @@ -102,7 +102,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ipgre", .ctl_name = NET_DECNET_CONF_GRE, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, }, #if 0 { @@ -113,7 +113,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 120, .name = "x25", .ctl_name = NET_DECNET_CONF_X25, - .timer3 = dn_send_ptp_hello, + .dn_timer3 = dn_send_ptp_hello, }, #endif #if 0 @@ -125,7 +125,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ppp", .ctl_name = NET_DECNET_CONF_PPP, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, }, #endif { @@ -136,7 +136,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 120, .name = "ddcmp", .ctl_name = NET_DECNET_CONF_DDCMP, - .timer3 = dn_send_ptp_hello, + .dn_timer3 = dn_send_ptp_hello, }, { .type = ARPHRD_LOOPBACK, /* Loopback interface - always last */ @@ -146,7 +146,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "loopback", .ctl_name = NET_DECNET_CONF_LOOPBACK, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, } }; @@ -305,11 +305,11 @@ static int dn_forwarding_proc(ctl_table */ tmp = dn_db->parms.forwarding; dn_db->parms.forwarding = old; - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dn_db->parms.forwarding = tmp; - if (dn_db->parms.up) - dn_db->parms.up(dev); + if (dn_db->parms.dn_up) + dn_db->parms.dn_up(dev); } return err; @@ -343,11 +343,11 @@ static int dn_forwarding_sysctl(ctl_tabl if (value > 2) return -EINVAL; - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dn_db->parms.forwarding = value; - if (dn_db->parms.up) - dn_db->parms.up(dev); + if (dn_db->parms.dn_up) + dn_db->parms.dn_up(dev); } return 0; @@ -1078,10 +1078,10 @@ static void dn_dev_timer_func(unsigned l struct dn_ifaddr *ifa; if (dn_db->t3 <= dn_db->parms.t2) { - if (dn_db->parms.timer3) { + if (dn_db->parms.dn_timer3) { for(ifa = dn_db->ifa_list; ifa; ifa = ifa->ifa_next) { if (!(ifa->ifa_flags & IFA_F_SECONDARY)) - dn_db->parms.timer3(dev, ifa); + dn_db->parms.dn_timer3(dev, ifa); } } dn_db->t3 = dn_db->parms.t3; @@ -1140,8 +1140,8 @@ static struct dn_dev *dn_dev_create(stru return NULL; } - if (dn_db->parms.up) { - if (dn_db->parms.up(dev) < 0) { + if (dn_db->parms.dn_up) { + if (dn_db->parms.dn_up(dev) < 0) { neigh_parms_release(&dn_neigh_table, dn_db->neigh_parms); dev->dn_ptr = NULL; kfree(dn_db); @@ -1235,8 +1235,8 @@ static void dn_dev_delete(struct net_dev dn_dev_check_default(dev); neigh_ifdown(&dn_neigh_table, dev); - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dev->dn_ptr = NULL; Index: linux-2.6-tip/net/ipv4/icmp.c =================================================================== --- linux-2.6-tip.orig/net/ipv4/icmp.c +++ linux-2.6-tip/net/ipv4/icmp.c @@ -201,7 +201,10 @@ static const struct icmp_control icmp_po */ static struct sock *icmp_sk(struct net *net) { - return net->ipv4.icmp_sk[smp_processor_id()]; + /* + * Should be safe on PREEMPT_SOFTIRQS/HARDIRQS to use raw-smp-processor-id: + */ + return net->ipv4.icmp_sk[raw_smp_processor_id()]; } static inline struct sock *icmp_xmit_lock(struct net *net) Index: linux-2.6-tip/net/ipv6/netfilter/ip6_tables.c =================================================================== --- linux-2.6-tip.orig/net/ipv6/netfilter/ip6_tables.c +++ linux-2.6-tip/net/ipv6/netfilter/ip6_tables.c @@ -376,7 +376,7 @@ ip6t_do_table(struct sk_buff *skb, read_lock_bh(&table->lock); IP_NF_ASSERT(table->valid_hooks & (1 << hook)); private = table->private; - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); /* For return from builtin chain */ Index: linux-2.6-tip/net/sched/sch_generic.c =================================================================== --- linux-2.6-tip.orig/net/sched/sch_generic.c +++ linux-2.6-tip/net/sched/sch_generic.c @@ -12,6 +12,7 @@ */ #include +#include #include #include #include @@ -24,6 +25,7 @@ #include #include #include +#include #include /* Main transmission queue. */ @@ -78,7 +80,7 @@ static inline int handle_dev_cpu_collisi { int ret; - if (unlikely(dev_queue->xmit_lock_owner == smp_processor_id())) { + if (unlikely(netif_tx_lock_recursion(dev_queue))) { /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We @@ -95,7 +97,9 @@ static inline int handle_dev_cpu_collisi * Another cpu is holding lock, requeue & delay xmits for * some time. */ + preempt_disable(); /* FIXME: we need an _rt version of this */ __get_cpu_var(netdev_rx_stat).cpu_collision++; + preempt_enable(); ret = dev_requeue_skb(skb, q); } @@ -141,7 +145,7 @@ static inline int qdisc_restart(struct Q dev = qdisc_dev(q); txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb)); - HARD_TX_LOCK(dev, txq, smp_processor_id()); + HARD_TX_LOCK(dev, txq); if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq)) ret = dev_hard_start_xmit(skb, dev, txq); @@ -691,9 +695,12 @@ void dev_deactivate(struct net_device *d /* Wait for outstanding qdisc-less dev_queue_xmit calls. */ synchronize_rcu(); - /* Wait for outstanding qdisc_run calls. */ + /* + * Wait for outstanding qdisc_run calls. + * TODO: shouldnt this be wakeup-based, instead of polling it? + */ while (some_qdisc_is_busy(dev)) - yield(); + msleep(1); } static void dev_init_scheduler_queue(struct net_device *dev, Index: linux-2.6-tip/drivers/net/bnx2.c =================================================================== --- linux-2.6-tip.orig/drivers/net/bnx2.c +++ linux-2.6-tip/drivers/net/bnx2.c @@ -2661,7 +2661,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2 if (unlikely(netif_tx_queue_stopped(txq)) && (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)) { - __netif_tx_lock(txq, smp_processor_id()); + __netif_tx_lock(txq); if ((netif_tx_queue_stopped(txq)) && (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)) netif_tx_wake_queue(txq); Index: linux-2.6-tip/drivers/net/mv643xx_eth.c =================================================================== --- linux-2.6-tip.orig/drivers/net/mv643xx_eth.c +++ linux-2.6-tip/drivers/net/mv643xx_eth.c @@ -484,7 +484,7 @@ static void txq_maybe_wake(struct tx_que struct netdev_queue *nq = netdev_get_tx_queue(mp->dev, txq->index); if (netif_tx_queue_stopped(nq)) { - __netif_tx_lock(nq, smp_processor_id()); + __netif_tx_lock(nq); if (txq->tx_ring_size - txq->tx_desc_count >= MAX_SKB_FRAGS + 1) netif_tx_wake_queue(nq); __netif_tx_unlock(nq); @@ -838,7 +838,7 @@ static void txq_kick(struct tx_queue *tx u32 hw_desc_ptr; u32 expected_ptr; - __netif_tx_lock(nq, smp_processor_id()); + __netif_tx_lock(nq); if (rdlp(mp, TXQ_COMMAND) & (1 << txq->index)) goto out; @@ -862,7 +862,7 @@ static int txq_reclaim(struct tx_queue * struct netdev_queue *nq = netdev_get_tx_queue(mp->dev, txq->index); int reclaimed; - __netif_tx_lock(nq, smp_processor_id()); + __netif_tx_lock(nq); reclaimed = 0; while (reclaimed < budget && txq->tx_desc_count > 0) { Index: linux-2.6-tip/drivers/net/niu.c =================================================================== --- linux-2.6-tip.orig/drivers/net/niu.c +++ linux-2.6-tip/drivers/net/niu.c @@ -3519,7 +3519,7 @@ static void niu_tx_work(struct niu *np, out: if (unlikely(netif_tx_queue_stopped(txq) && (niu_tx_avail(rp) > NIU_TX_WAKEUP_THRESH(rp)))) { - __netif_tx_lock(txq, smp_processor_id()); + __netif_tx_lock(txq); if (netif_tx_queue_stopped(txq) && (niu_tx_avail(rp) > NIU_TX_WAKEUP_THRESH(rp))) netif_tx_wake_queue(txq); Index: linux-2.6-tip/block/blk-core.c =================================================================== --- linux-2.6-tip.orig/block/blk-core.c +++ linux-2.6-tip/block/blk-core.c @@ -212,7 +212,7 @@ EXPORT_SYMBOL(blk_dump_rq_flags); */ void blk_plug_device(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); /* * don't plug a stopped queue, it must be paired with blk_start_queue() @@ -252,7 +252,7 @@ EXPORT_SYMBOL(blk_plug_device_unlocked); */ int blk_remove_plug(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); if (!queue_flag_test_and_clear(QUEUE_FLAG_PLUGGED, q)) return 0; @@ -362,7 +362,7 @@ static void blk_invoke_request_fn(struct **/ void blk_start_queue(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); queue_flag_clear(QUEUE_FLAG_STOPPED, q); blk_invoke_request_fn(q); Index: linux-2.6-tip/fs/aio.c =================================================================== --- linux-2.6-tip.orig/fs/aio.c +++ linux-2.6-tip/fs/aio.c @@ -605,9 +605,11 @@ static void use_mm(struct mm_struct *mm) task_lock(tsk); active_mm = tsk->active_mm; atomic_inc(&mm->mm_count); + local_irq_disable(); // FIXME + switch_mm(active_mm, mm, tsk); tsk->mm = mm; tsk->active_mm = mm; - switch_mm(active_mm, mm, tsk); + local_irq_enable(); task_unlock(tsk); mmdrop(active_mm); Index: linux-2.6-tip/fs/file.c =================================================================== --- linux-2.6-tip.orig/fs/file.c +++ linux-2.6-tip/fs/file.c @@ -102,14 +102,15 @@ void free_fdtable_rcu(struct rcu_head *r kfree(fdt->open_fds); kfree(fdt); } else { - fddef = &get_cpu_var(fdtable_defer_list); + + fddef = &per_cpu(fdtable_defer_list, raw_smp_processor_id()); + spin_lock(&fddef->lock); fdt->next = fddef->next; fddef->next = fdt; /* vmallocs are handled from the workqueue context */ schedule_work(&fddef->wq); spin_unlock(&fddef->lock); - put_cpu_var(fdtable_defer_list); } } Index: linux-2.6-tip/fs/notify/dnotify/dnotify.c =================================================================== --- linux-2.6-tip.orig/fs/notify/dnotify/dnotify.c +++ linux-2.6-tip/fs/notify/dnotify/dnotify.c @@ -170,7 +170,7 @@ void dnotify_parent(struct dentry *dentr spin_lock(&dentry->d_lock); parent = dentry->d_parent; - if (parent->d_inode->i_dnotify_mask & event) { + if (unlikely(parent->d_inode->i_dnotify_mask & event)) { dget(parent); spin_unlock(&dentry->d_lock); __inode_dir_notify(parent->d_inode, event); Index: linux-2.6-tip/fs/pipe.c =================================================================== --- linux-2.6-tip.orig/fs/pipe.c +++ linux-2.6-tip/fs/pipe.c @@ -386,8 +386,14 @@ redo: wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + /* + * Hack: we turn off atime updates for -RT kernels. + * Who uses them on pipes anyway? + */ +#ifndef CONFIG_PREEMPT_RT if (ret > 0) file_accessed(filp); +#endif return ret; } @@ -559,8 +565,14 @@ out: wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } + /* + * Hack: we turn off atime updates for -RT kernels. + * Who uses them on pipes anyway? + */ +#ifndef CONFIG_PREEMPT_RT if (ret > 0) file_update_time(filp); +#endif return ret; } Index: linux-2.6-tip/fs/proc/task_mmu.c =================================================================== --- linux-2.6-tip.orig/fs/proc/task_mmu.c +++ linux-2.6-tip/fs/proc/task_mmu.c @@ -137,8 +137,10 @@ static void *m_start(struct seq_file *m, vma = NULL; if ((unsigned long)l < mm->map_count) { vma = mm->mmap; - while (l-- && vma) + while (l-- && vma) { vma = vma->vm_next; + cond_resched(); + } goto out; } Index: linux-2.6-tip/fs/xfs/linux-2.6/mrlock.h =================================================================== --- linux-2.6-tip.orig/fs/xfs/linux-2.6/mrlock.h +++ linux-2.6-tip/fs/xfs/linux-2.6/mrlock.h @@ -21,7 +21,7 @@ #include typedef struct { - struct rw_semaphore mr_lock; + struct compat_rw_semaphore mr_lock; #ifdef DEBUG int mr_writer; #endif Index: linux-2.6-tip/fs/xfs/xfs_mount.h =================================================================== --- linux-2.6-tip.orig/fs/xfs/xfs_mount.h +++ linux-2.6-tip/fs/xfs/xfs_mount.h @@ -275,7 +275,7 @@ typedef struct xfs_mount { uint m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */ uint m_in_maxlevels; /* XFS_IN_MAXLEVELS */ struct xfs_perag *m_perag; /* per-ag accounting info */ - struct rw_semaphore m_peraglock; /* lock for m_perag (pointer) */ + struct compat_rw_semaphore m_peraglock; /* lock for m_perag (pointer) */ struct mutex m_growlock; /* growfs mutex */ int m_fixedfsid[2]; /* unchanged for life of FS */ uint m_dmevmask; /* DMI events for this FS */ Index: linux-2.6-tip/drivers/acpi/acpica/acglobal.h =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/acglobal.h +++ linux-2.6-tip/drivers/acpi/acpica/acglobal.h @@ -190,7 +190,12 @@ ACPI_EXTERN u8 acpi_gbl_global_lock_pres * interrupt level */ ACPI_EXTERN spinlock_t _acpi_gbl_gpe_lock; /* For GPE data structs and registers */ -ACPI_EXTERN spinlock_t _acpi_gbl_hardware_lock; /* For ACPI H/W except GPE registers */ + +/* + * Need to be raw because it might be used in acpi_processor_idle(): + */ +ACPI_EXTERN raw_spinlock_t _acpi_gbl_hardware_lock; /* For ACPI H/W except GPE registers */ + #define acpi_gbl_gpe_lock &_acpi_gbl_gpe_lock #define acpi_gbl_hardware_lock &_acpi_gbl_hardware_lock Index: linux-2.6-tip/drivers/acpi/acpica/hwregs.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/hwregs.c +++ linux-2.6-tip/drivers/acpi/acpica/hwregs.c @@ -74,7 +74,7 @@ acpi_status acpi_hw_clear_acpi_status(vo ACPI_BITMASK_ALL_FIXED_STATUS, (u16) acpi_gbl_FADT.xpm1a_event_block.address)); - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); status = acpi_hw_register_write(ACPI_REGISTER_PM1_STATUS, ACPI_BITMASK_ALL_FIXED_STATUS); @@ -97,7 +97,7 @@ acpi_status acpi_hw_clear_acpi_status(vo status = acpi_ev_walk_gpe_list(acpi_hw_clear_gpe_block, NULL); unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); return_ACPI_STATUS(status); } Index: linux-2.6-tip/drivers/acpi/acpica/hwxface.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/hwxface.c +++ linux-2.6-tip/drivers/acpi/acpica/hwxface.c @@ -313,9 +313,9 @@ acpi_status acpi_get_register(u32 regist acpi_status status; acpi_cpu_flags flags; - flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, flags); status = acpi_get_register_unlocked(register_id, return_value); - acpi_os_release_lock(acpi_gbl_hardware_lock, flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, flags); return (status); } @@ -353,7 +353,7 @@ acpi_status acpi_set_register(u32 regist return_ACPI_STATUS(AE_BAD_PARAMETER); } - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); /* Always do a register read first so we can insert the new bits */ @@ -458,7 +458,7 @@ acpi_status acpi_set_register(u32 regist unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); /* Normalize the value that was read */ Index: linux-2.6-tip/drivers/acpi/acpica/utmutex.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/acpica/utmutex.c +++ linux-2.6-tip/drivers/acpi/acpica/utmutex.c @@ -117,7 +117,7 @@ void acpi_ut_mutex_terminate(void) /* Delete the spinlocks */ acpi_os_delete_lock(acpi_gbl_gpe_lock); - acpi_os_delete_lock(acpi_gbl_hardware_lock); +// acpi_os_delete_lock(acpi_gbl_hardware_lock); return_VOID; } Index: linux-2.6-tip/drivers/acpi/ec.c =================================================================== --- linux-2.6-tip.orig/drivers/acpi/ec.c +++ linux-2.6-tip/drivers/acpi/ec.c @@ -563,8 +563,21 @@ static u32 acpi_ec_gpe_handler(void *dat if (test_bit(EC_FLAGS_GPE_MODE, &ec->flags)) { gpe_transaction(ec, status); if (ec_transaction_done(ec) && - (status & ACPI_EC_FLAG_IBF) == 0) + (status & ACPI_EC_FLAG_IBF) == 0) { +#if 0 wake_up(&ec->wait); +#else + // hack ... + if (waitqueue_active(&ec->wait)) { + struct task_struct *task; + + task = list_entry(ec->wait.task_list.next, + wait_queue_t, task_list)->private; + if (task) + wake_up_process(task); + } +#endif + } } ec_check_sci(ec, status); Index: linux-2.6-tip/ipc/mqueue.c =================================================================== --- linux-2.6-tip.orig/ipc/mqueue.c +++ linux-2.6-tip/ipc/mqueue.c @@ -787,12 +787,17 @@ static inline void pipelined_send(struct struct msg_msg *message, struct ext_wait_queue *receiver) { + /* + * Keep them in one critical section for PREEMPT_RT: + */ + preempt_disable(); receiver->msg = message; list_del(&receiver->list); receiver->state = STATE_PENDING; wake_up_process(receiver->task); smp_wmb(); receiver->state = STATE_READY; + preempt_enable(); } /* pipelined_receive() - if there is task waiting in sys_mq_timedsend() Index: linux-2.6-tip/ipc/msg.c =================================================================== --- linux-2.6-tip.orig/ipc/msg.c +++ linux-2.6-tip/ipc/msg.c @@ -259,12 +259,19 @@ static void expunge_all(struct msg_queue while (tmp != &msq->q_receivers) { struct msg_receiver *msr; + /* + * Make sure that the wakeup doesnt preempt + * this CPU prematurely. (on PREEMPT_RT) + */ + preempt_disable(); + msr = list_entry(tmp, struct msg_receiver, r_list); tmp = tmp->next; msr->r_msg = NULL; - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = ERR_PTR(res); + + preempt_enable(); } } @@ -611,22 +618,28 @@ static inline int pipelined_send(struct !security_msg_queue_msgrcv(msq, msg, msr->r_tsk, msr->r_msgtype, msr->r_mode)) { + /* + * Make sure that the wakeup doesnt preempt + * this CPU prematurely. (on PREEMPT_RT) + */ + preempt_disable(); + list_del(&msr->r_list); if (msr->r_maxsize < msg->m_ts) { msr->r_msg = NULL; - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = ERR_PTR(-E2BIG); } else { msr->r_msg = NULL; msq->q_lrpid = task_pid_vnr(msr->r_tsk); msq->q_rtime = get_seconds(); - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = msg; + preempt_enable(); return 1; } + preempt_enable(); } } return 0; Index: linux-2.6-tip/ipc/sem.c =================================================================== --- linux-2.6-tip.orig/ipc/sem.c +++ linux-2.6-tip/ipc/sem.c @@ -415,6 +415,11 @@ static void update_queue (struct sem_arr struct sem_queue *n; /* + * make sure that the wakeup doesnt preempt + * _this_ cpu prematurely. (on preempt_rt) + */ + preempt_disable(); + /* * Continue scanning. The next operation * that must be checked depends on the type of the * completed operation: @@ -450,6 +455,7 @@ static void update_queue (struct sem_arr */ smp_wmb(); q->status = error; + preempt_enable(); q = n; } else { q = list_entry(q->list.next, struct sem_queue, list); Index: linux-2.6-tip/include/linux/pagevec.h =================================================================== --- linux-2.6-tip.orig/include/linux/pagevec.h +++ linux-2.6-tip/include/linux/pagevec.h @@ -9,7 +9,7 @@ #define _LINUX_PAGEVEC_H /* 14 pointers + two long's align the pagevec structure to a power of two */ -#define PAGEVEC_SIZE 14 +#define PAGEVEC_SIZE 8 struct page; struct address_space; Index: linux-2.6-tip/include/linux/vmstat.h =================================================================== --- linux-2.6-tip.orig/include/linux/vmstat.h +++ linux-2.6-tip/include/linux/vmstat.h @@ -75,7 +75,12 @@ DECLARE_PER_CPU(struct vm_event_state, v static inline void __count_vm_event(enum vm_event_item item) { +#ifdef CONFIG_PREEMPT_RT + get_cpu_var(vm_event_states).event[item]++; + put_cpu(); +#else __get_cpu_var(vm_event_states).event[item]++; +#endif } static inline void count_vm_event(enum vm_event_item item) @@ -86,7 +91,12 @@ static inline void count_vm_event(enum v static inline void __count_vm_events(enum vm_event_item item, long delta) { +#ifdef CONFIG_PREEMPT_RT + get_cpu_var(vm_event_states).event[item] += delta; + put_cpu(); +#else __get_cpu_var(vm_event_states).event[item] += delta; +#endif } static inline void count_vm_events(enum vm_event_item item, long delta) Index: linux-2.6-tip/mm/bounce.c =================================================================== --- linux-2.6-tip.orig/mm/bounce.c +++ linux-2.6-tip/mm/bounce.c @@ -51,11 +51,11 @@ static void bounce_copy_vec(struct bio_v unsigned long flags; unsigned char *vto; - local_irq_save(flags); + local_irq_save_nort(flags); vto = kmap_atomic(to->bv_page, KM_BOUNCE_READ); memcpy(vto + to->bv_offset, vfrom, to->bv_len); kunmap_atomic(vto, KM_BOUNCE_READ); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #else /* CONFIG_HIGHMEM */ Index: linux-2.6-tip/mm/memory.c =================================================================== --- linux-2.6-tip.orig/mm/memory.c +++ linux-2.6-tip/mm/memory.c @@ -922,10 +922,13 @@ static unsigned long unmap_page_range(st return addr; } -#ifdef CONFIG_PREEMPT +#if defined(CONFIG_PREEMPT) && !defined(CONFIG_PREEMPT_RT) # define ZAP_BLOCK_SIZE (8 * PAGE_SIZE) #else -/* No preempt: go for improved straight-line efficiency */ +/* + * No preempt: go for improved straight-line efficiency + * on PREEMPT_RT this is not a critical latency-path. + */ # define ZAP_BLOCK_SIZE (1024 * PAGE_SIZE) #endif @@ -2838,6 +2841,28 @@ unlock: return 0; } +void pagefault_disable(void) +{ + current->pagefault_disabled++; + /* + * make sure to have issued the store before a pagefault + * can hit. + */ + barrier(); +} +EXPORT_SYMBOL(pagefault_disable); + +void pagefault_enable(void) +{ + /* + * make sure to issue those last loads/stores before enabling + * the pagefault handler again. + */ + barrier(); + current->pagefault_disabled--; +} +EXPORT_SYMBOL(pagefault_enable); + /* * By the time we get here, we already hold the mm semaphore */ Index: linux-2.6-tip/mm/mmap.c =================================================================== --- linux-2.6-tip.orig/mm/mmap.c +++ linux-2.6-tip/mm/mmap.c @@ -1961,10 +1961,16 @@ SYSCALL_DEFINE2(munmap, unsigned long, a static inline void verify_mm_writelocked(struct mm_struct *mm) { #ifdef CONFIG_DEBUG_VM - if (unlikely(down_read_trylock(&mm->mmap_sem))) { +# ifdef CONFIG_PREEMPT_RT + if (unlikely(!rt_rwsem_is_locked(&mm->mmap_sem))) { WARN_ON(1); - up_read(&mm->mmap_sem); } +# else + if (unlikely(down_read_trylock(&mm->mmap_sem))) { + WARN_ON(1); + up_read(&mm->mmap_sem); + } +# endif #endif } Index: linux-2.6-tip/mm/vmstat.c =================================================================== --- linux-2.6-tip.orig/mm/vmstat.c +++ linux-2.6-tip/mm/vmstat.c @@ -153,10 +153,14 @@ static void refresh_zone_stat_thresholds void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; long x; + s8 *p; + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; x = delta + *p; if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) { @@ -164,6 +168,7 @@ void __mod_zone_page_state(struct zone * x = 0; } *p = x; + put_cpu(); } EXPORT_SYMBOL(__mod_zone_page_state); @@ -206,9 +211,13 @@ EXPORT_SYMBOL(mod_zone_page_state); */ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; + s8 *p; + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; (*p)++; if (unlikely(*p > pcp->stat_threshold)) { @@ -217,18 +226,34 @@ void __inc_zone_state(struct zone *zone, zone_page_state_add(*p + overstep, zone, item); *p = -overstep; } + put_cpu(); } void __inc_zone_page_state(struct page *page, enum zone_stat_item item) { +#ifdef CONFIG_PREEMPT_RT + unsigned long flags; + struct zone *zone; + + zone = page_zone(page); + local_irq_save(flags); + __inc_zone_state(zone, item); + local_irq_restore(flags); +#else __inc_zone_state(page_zone(page), item); +#endif } EXPORT_SYMBOL(__inc_zone_page_state); void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; + s8 *p; + + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; (*p)--; @@ -238,6 +263,7 @@ void __dec_zone_state(struct zone *zone, zone_page_state_add(*p - overstep, zone, item); *p = overstep; } + put_cpu(); } void __dec_zone_page_state(struct page *page, enum zone_stat_item item) Index: linux-2.6-tip/drivers/block/paride/pseudo.h =================================================================== --- linux-2.6-tip.orig/drivers/block/paride/pseudo.h +++ linux-2.6-tip/drivers/block/paride/pseudo.h @@ -43,7 +43,7 @@ static unsigned long ps_timeout; static int ps_tq_active = 0; static int ps_nice = 0; -static DEFINE_SPINLOCK(ps_spinlock __attribute__((unused))); +static __attribute__((unused)) DEFINE_SPINLOCK(ps_spinlock); static DECLARE_DELAYED_WORK(ps_tq, ps_tq_int); Index: linux-2.6-tip/drivers/video/console/fbcon.c =================================================================== --- linux-2.6-tip.orig/drivers/video/console/fbcon.c +++ linux-2.6-tip/drivers/video/console/fbcon.c @@ -1203,7 +1203,6 @@ static void fbcon_clear(struct vc_data * { struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]]; struct fbcon_ops *ops = info->fbcon_par; - struct display *p = &fb_display[vc->vc_num]; u_int y_break; @@ -1235,10 +1234,11 @@ static void fbcon_putcs(struct vc_data * struct display *p = &fb_display[vc->vc_num]; struct fbcon_ops *ops = info->fbcon_par; - if (!fbcon_is_inactive(vc, info)) + if (!fbcon_is_inactive(vc, info)) { ops->putcs(vc, info, s, count, real_y(p, ypos), xpos, get_color(vc, info, scr_readw(s), 1), get_color(vc, info, scr_readw(s), 0)); + } } static void fbcon_putc(struct vc_data *vc, int c, int ypos, int xpos) @@ -3221,6 +3221,7 @@ static const struct consw fb_con = { .con_screen_pos = fbcon_screen_pos, .con_getxy = fbcon_getxy, .con_resize = fbcon_resize, + .con_preemptible = 1, }; static struct notifier_block fbcon_event_notifier = { Index: linux-2.6-tip/include/linux/console.h =================================================================== --- linux-2.6-tip.orig/include/linux/console.h +++ linux-2.6-tip/include/linux/console.h @@ -55,6 +55,7 @@ struct consw { void (*con_invert_region)(struct vc_data *, u16 *, int); u16 *(*con_screen_pos)(struct vc_data *, int); unsigned long (*con_getxy)(struct vc_data *, unsigned long, int *, int *); + int con_preemptible; // can it reschedule from within printk? }; extern const struct consw *conswitchp; @@ -92,6 +93,17 @@ void give_up_console(const struct consw #define CON_BOOT (8) #define CON_ANYTIME (16) /* Safe to call when cpu is offline */ #define CON_BRL (32) /* Used for a braille device */ +#define CON_ATOMIC (64) /* Safe to call in PREEMPT_RT atomic */ + +#ifdef CONFIG_PREEMPT_RT +# define console_atomic_safe(con) \ + (((con)->flags & CON_ATOMIC) || \ + (!in_atomic() && !irqs_disabled()) || \ + (system_state != SYSTEM_RUNNING) || \ + oops_in_progress) +#else +# define console_atomic_safe(con) (1) +#endif struct console { char name[16]; Index: linux-2.6-tip/drivers/ide/alim15x3.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/alim15x3.c +++ linux-2.6-tip/drivers/ide/alim15x3.c @@ -90,7 +90,7 @@ static void ali_set_pio_mode(ide_drive_t if (r_clc >= 16) r_clc = 0; } - local_irq_save(flags); + local_irq_save_nort(flags); /* * PIO mode => ATA FIFO on, ATAPI FIFO off @@ -112,7 +112,7 @@ static void ali_set_pio_mode(ide_drive_t pci_write_config_byte(dev, port, s_clc); pci_write_config_byte(dev, port + unit + 2, (a_clc << 4) | r_clc); - local_irq_restore(flags); + local_irq_restore_nort(flags); } /** @@ -222,7 +222,7 @@ static unsigned int init_chipset_ali15x3 isa_dev = pci_get_device(PCI_VENDOR_ID_AL, PCI_DEVICE_ID_AL_M1533, NULL); - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision < 0xC2) { /* @@ -313,7 +313,7 @@ out: } pci_dev_put(north); pci_dev_put(isa_dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } @@ -375,7 +375,7 @@ static u8 ali_cable_detect(ide_hwif_t *h unsigned long flags; u8 cbl = ATA_CBL_PATA40, tmpbyte; - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision >= 0xC2) { /* @@ -396,7 +396,7 @@ static u8 ali_cable_detect(ide_hwif_t *h } } - local_irq_restore(flags); + local_irq_restore_nort(flags); return cbl; } Index: linux-2.6-tip/drivers/ide/hpt366.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/hpt366.c +++ linux-2.6-tip/drivers/ide/hpt366.c @@ -1328,7 +1328,7 @@ static int __devinit init_dma_hpt366(ide dma_old = inb(base + 2); - local_irq_save(flags); + local_irq_save_nort(flags); dma_new = dma_old; pci_read_config_byte(dev, hwif->channel ? 0x4b : 0x43, &masterdma); @@ -1339,7 +1339,7 @@ static int __devinit init_dma_hpt366(ide if (dma_new != dma_old) outb(dma_new, base + 2); - local_irq_restore(flags); + local_irq_restore_nort(flags); printk(KERN_INFO " %s: BM-DMA at 0x%04lx-0x%04lx\n", hwif->name, base, base + 7); Index: linux-2.6-tip/drivers/ide/ide-io.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/ide-io.c +++ linux-2.6-tip/drivers/ide/ide-io.c @@ -949,7 +949,7 @@ void ide_timer_expiry (unsigned long dat /* disable_irq_nosync ?? */ disable_irq(hwif->irq); /* local CPU only, as if we were handling an interrupt */ - local_irq_disable(); + local_irq_disable_nort(); if (hwif->polling) { startstop = handler(drive); } else if (drive_is_ready(drive)) { Index: linux-2.6-tip/drivers/ide/ide-iops.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/ide-iops.c +++ linux-2.6-tip/drivers/ide/ide-iops.c @@ -275,7 +275,7 @@ void ide_input_data(ide_drive_t *drive, unsigned long uninitialized_var(flags); if ((io_32bit & 2) && !mmio) { - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(io_ports->nsect_addr); } @@ -285,7 +285,7 @@ void ide_input_data(ide_drive_t *drive, insl(data_addr, buf, len / 4); if ((io_32bit & 2) && !mmio) - local_irq_restore(flags); + local_irq_restore_nort(flags); if ((len & 3) >= 2) { if (mmio) @@ -319,7 +319,7 @@ void ide_output_data(ide_drive_t *drive, unsigned long uninitialized_var(flags); if ((io_32bit & 2) && !mmio) { - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(io_ports->nsect_addr); } @@ -329,7 +329,7 @@ void ide_output_data(ide_drive_t *drive, outsl(data_addr, buf, len / 4); if ((io_32bit & 2) && !mmio) - local_irq_restore(flags); + local_irq_restore_nort(flags); if ((len & 3) >= 2) { if (mmio) @@ -507,12 +507,12 @@ static int __ide_wait_stat(ide_drive_t * if ((stat & ATA_BUSY) == 0) break; - local_irq_restore(flags); + local_irq_restore_nort(flags); *rstat = stat; return -EBUSY; } } - local_irq_restore(flags); + local_irq_restore_nort(flags); } /* * Allow status to settle, then read it again. @@ -679,17 +679,17 @@ int ide_driveid_update(ide_drive_t *driv printk("%s: CHECK for good STATUS\n", drive->name); return 0; } - local_irq_save(flags); + local_irq_save_nort(flags); SELECT_MASK(drive, 0); id = kmalloc(SECTOR_SIZE, GFP_ATOMIC); if (!id) { - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } tp_ops->input_data(drive, NULL, id, SECTOR_SIZE); (void)tp_ops->read_status(hwif); /* clear drive IRQ */ - local_irq_enable(); - local_irq_restore(flags); + local_irq_enable_nort(); + local_irq_restore_nort(flags); ide_fix_driveid(id); drive->id[ATA_ID_UDMA_MODES] = id[ATA_ID_UDMA_MODES]; Index: linux-2.6-tip/drivers/ide/ide-probe.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/ide-probe.c +++ linux-2.6-tip/drivers/ide/ide-probe.c @@ -196,10 +196,10 @@ static void do_identify(ide_drive_t *dri int bswap = 1; /* local CPU only; some systems need this */ - local_irq_save(flags); + local_irq_save_nort(flags); /* read 512 bytes of id info */ hwif->tp_ops->input_data(drive, NULL, id, SECTOR_SIZE); - local_irq_restore(flags); + local_irq_restore_nort(flags); drive->dev_flags |= IDE_DFLAG_ID_READ; #ifdef DEBUG @@ -813,7 +813,7 @@ static int ide_probe_port(ide_hwif_t *hw rc = 0; } - local_irq_restore(flags); + local_irq_restore_nort(flags); /* * Use cached IRQ number. It might be (and is...) changed by probe Index: linux-2.6-tip/drivers/ide/ide-taskfile.c =================================================================== --- linux-2.6-tip.orig/drivers/ide/ide-taskfile.c +++ linux-2.6-tip/drivers/ide/ide-taskfile.c @@ -219,7 +219,7 @@ static void ide_pio_sector(ide_drive_t * offset %= PAGE_SIZE; #ifdef CONFIG_HIGHMEM - local_irq_save(flags); + local_irq_save_nort(flags); #endif buf = kmap_atomic(page, KM_BIO_SRC_IRQ) + offset; @@ -239,7 +239,7 @@ static void ide_pio_sector(ide_drive_t * kunmap_atomic(buf, KM_BIO_SRC_IRQ); #ifdef CONFIG_HIGHMEM - local_irq_restore(flags); + local_irq_restore_nort(flags); #endif } @@ -430,7 +430,7 @@ static ide_startstop_t pre_task_out_intr } if ((drive->dev_flags & IDE_DFLAG_UNMASK) == 0) - local_irq_disable(); + local_irq_disable_nort(); ide_set_handler(drive, &task_out_intr, WAIT_WORSTCASE, NULL); ide_pio_datablock(drive, rq, 1); Index: linux-2.6-tip/drivers/input/gameport/gameport.c =================================================================== --- linux-2.6-tip.orig/drivers/input/gameport/gameport.c +++ linux-2.6-tip/drivers/input/gameport/gameport.c @@ -20,6 +20,7 @@ #include #include #include +#include #include /* HZ */ #include #include @@ -98,12 +99,12 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); GET_TIME(t1); for (t = 0; t < 50; t++) gameport_read(gameport); GET_TIME(t2); GET_TIME(t3); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if ((t = DELTA(t2,t1) - DELTA(t3,t2)) < tx) tx = t; } @@ -122,11 +123,11 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); rdtscl(t1); for (t = 0; t < 50; t++) gameport_read(gameport); rdtscl(t2); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if (t2 - t1 < tx) tx = t2 - t1; } Index: linux-2.6-tip/drivers/net/tulip/tulip_core.c =================================================================== --- linux-2.6-tip.orig/drivers/net/tulip/tulip_core.c +++ linux-2.6-tip/drivers/net/tulip/tulip_core.c @@ -1801,6 +1801,7 @@ static void __devexit tulip_remove_one ( pci_iounmap(pdev, tp->base_addr); free_netdev (dev); pci_release_regions (pdev); + pci_disable_device (pdev); pci_set_drvdata (pdev, NULL); /* pci_power_off (pdev, -1); */ Index: linux-2.6-tip/kernel/printk.c =================================================================== --- linux-2.6-tip.orig/kernel/printk.c +++ linux-2.6-tip/kernel/printk.c @@ -91,7 +91,7 @@ static int console_locked, console_suspe * It is also used in interesting ways to provide interlocking in * release_console_sem(). */ -static DEFINE_SPINLOCK(logbuf_lock); +static DEFINE_RAW_SPINLOCK(logbuf_lock); #define LOG_BUF_MASK (log_buf_len-1) #define LOG_BUF(idx) (log_buf[(idx) & LOG_BUF_MASK]) @@ -395,9 +395,13 @@ static void __call_console_drivers(unsig for (con = console_drivers; con; con = con->next) { if ((con->flags & CON_ENABLED) && con->write && - (cpu_online(smp_processor_id()) || - (con->flags & CON_ANYTIME))) + console_atomic_safe(con) && + (cpu_online(raw_smp_processor_id()) || + (con->flags & CON_ANYTIME))) { + set_printk_might_sleep(1); con->write(con, &LOG_BUF(start), end - start); + set_printk_might_sleep(0); + } } } @@ -511,6 +515,7 @@ static void zap_locks(void) spin_lock_init(&logbuf_lock); /* And make sure that we print immediately */ init_MUTEX(&console_sem); + zap_rt_locks(); } #if defined(CONFIG_PRINTK_TIME) @@ -592,7 +597,8 @@ static inline int can_use_console(unsign * interrupts disabled. It should return with 'lockbuf_lock' * released but interrupts still disabled. */ -static int acquire_console_semaphore_for_printk(unsigned int cpu) +static int acquire_console_semaphore_for_printk(unsigned int cpu, + unsigned long flags) { int retval = 0; @@ -613,6 +619,8 @@ static int acquire_console_semaphore_for } printk_cpu = UINT_MAX; spin_unlock(&logbuf_lock); + lockdep_on(); + local_irq_restore(flags); return retval; } static const char recursion_bug_msg [] = @@ -634,7 +642,7 @@ asmlinkage int vprintk(const char *fmt, preempt_disable(); /* This stops the holder of console_sem just where we want him */ raw_local_irq_save(flags); - this_cpu = smp_processor_id(); + this_cpu = raw_smp_processor_id(); /* * Ouch, printk recursed into itself! @@ -649,7 +657,8 @@ asmlinkage int vprintk(const char *fmt, */ if (!oops_in_progress) { recursion_bug = 1; - goto out_restore_irqs; + raw_local_irq_restore(flags); + goto out; } zap_locks(); } @@ -657,6 +666,7 @@ asmlinkage int vprintk(const char *fmt, lockdep_off(); spin_lock(&logbuf_lock); printk_cpu = this_cpu; + preempt_enable(); if (recursion_bug) { recursion_bug = 0; @@ -726,14 +736,10 @@ asmlinkage int vprintk(const char *fmt, * will release 'logbuf_lock' regardless of whether it * actually gets the semaphore or not. */ - if (acquire_console_semaphore_for_printk(this_cpu)) + if (acquire_console_semaphore_for_printk(this_cpu, flags)) release_console_sem(); - lockdep_on(); -out_restore_irqs: - raw_local_irq_restore(flags); - - preempt_enable(); +out: return printed_len; } EXPORT_SYMBOL(printk); @@ -996,15 +1002,35 @@ void release_console_sem(void) _con_start = con_start; _log_end = log_end; con_start = log_end; /* Flush */ + /* + * on PREEMPT_RT, call console drivers with + * interrupts enabled (if printk was called + * with interrupts disabled): + */ +#ifdef CONFIG_PREEMPT_RT + spin_unlock_irqrestore(&logbuf_lock, flags); +#else spin_unlock(&logbuf_lock); stop_critical_timings(); /* don't trace print latency */ +#endif call_console_drivers(_con_start, _log_end); start_critical_timings(); +#ifndef CONFIG_PREEMPT_RT local_irq_restore(flags); +#endif } console_locked = 0; - up(&console_sem); spin_unlock_irqrestore(&logbuf_lock, flags); + up(&console_sem); + /* + * On PREEMPT_RT kernels __wake_up may sleep, so wake syslogd + * up only if we are in a preemptible section. We normally dont + * printk from non-preemptible sections so this is for the emergency + * case only. + */ +#ifdef CONFIG_PREEMPT_RT + if (!in_atomic() && !irqs_disabled()) +#endif if (wake_klogd) wake_up_klogd(); } @@ -1280,6 +1306,23 @@ int printk_ratelimit(void) } EXPORT_SYMBOL(printk_ratelimit); +static DEFINE_RAW_SPINLOCK(warn_lock); + +void __WARN_ON(const char *func, const char *file, const int line) +{ + unsigned long flags; + + spin_lock_irqsave(&warn_lock, flags); + printk("%s/%d[CPU#%d]: BUG in %s at %s:%d\n", + current->comm, current->pid, raw_smp_processor_id(), + func, file, line); + dump_stack(); + spin_unlock_irqrestore(&warn_lock, flags); +} + +EXPORT_SYMBOL(__WARN_ON); + + /** * printk_timed_ratelimit - caller-controlled printk ratelimiting * @caller_jiffies: pointer to caller's state Index: linux-2.6-tip/lib/ratelimit.c =================================================================== --- linux-2.6-tip.orig/lib/ratelimit.c +++ linux-2.6-tip/lib/ratelimit.c @@ -14,7 +14,7 @@ #include #include -static DEFINE_SPINLOCK(ratelimit_lock); +static DEFINE_RAW_SPINLOCK(ratelimit_lock); /* * __ratelimit - rate limiting Index: linux-2.6-tip/drivers/oprofile/oprofilefs.c =================================================================== --- linux-2.6-tip.orig/drivers/oprofile/oprofilefs.c +++ linux-2.6-tip/drivers/oprofile/oprofilefs.c @@ -21,7 +21,7 @@ #define OPROFILEFS_MAGIC 0x6f70726f -DEFINE_SPINLOCK(oprofilefs_lock); +DEFINE_RAW_SPINLOCK(oprofilefs_lock); static struct inode *oprofilefs_get_inode(struct super_block *sb, int mode) { Index: linux-2.6-tip/drivers/pci/access.c =================================================================== --- linux-2.6-tip.orig/drivers/pci/access.c +++ linux-2.6-tip/drivers/pci/access.c @@ -12,7 +12,7 @@ * configuration space. */ -static DEFINE_SPINLOCK(pci_lock); +static DEFINE_RAW_SPINLOCK(pci_lock); /* * Wrappers for all PCI configuration access functions. They just check Index: linux-2.6-tip/drivers/video/console/vgacon.c =================================================================== --- linux-2.6-tip.orig/drivers/video/console/vgacon.c +++ linux-2.6-tip/drivers/video/console/vgacon.c @@ -51,7 +51,7 @@ #include