commit 343a5fbeef08baf2097b8cf4e26137cebe3cfef4 Author: Zefan Li Date: Wed Apr 27 18:55:30 2016 +0800 Linux 3.4.112 commit 97e0c9082179f7abe08d6921d3594749e2431542 Author: Andy Lutomirski Date: Wed Mar 16 14:14:21 2016 -0700 x86/iopl/64: Properly context-switch IOPL on Xen PV commit b7a584598aea7ca73140cb87b40319944dd3393f upstream. On Xen PV, regs->flags doesn't reliably reflect IOPL and the exit-to-userspace code doesn't change IOPL. We need to context switch it manually. I'm doing this without going through paravirt because this is specific to Xen PV. After the dust settles, we can merge this with the 32-bit code, tidy up the iopl syscall implementation, and remove the set_iopl pvop entirely. Fixes XSA-171. Reviewewd-by: Jan Beulich Signed-off-by: Andy Lutomirski Cc: Andrew Cooper Cc: Andy Lutomirski Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Brian Gerst Cc: David Vrabel Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Jan Beulich Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.org Signed-off-by: Ingo Molnar [ kamal: backport to 3.19-stable: no X86_FEATURE_XENPV so just call xen_pv_domain() directly ] Acked-by: Andy Lutomirski kernel.org> Signed-off-by: Kamal Mostafa canonical.com> Signed-off-by: Zefan Li commit 0765dbc54e7dcd07581edf6cc0fafc8277bbe331 Author: Christophe Leroy Date: Wed May 6 17:26:47 2015 +0200 splice: sendfile() at once fails for big files commit 0ff28d9f4674d781e492bcff6f32f0fe48cf0fed upstream. Using sendfile with below small program to get MD5 sums of some files, it appear that big files (over 64kbytes with 4k pages system) get a wrong MD5 sum while small files get the correct sum. This program uses sendfile() to send a file to an AF_ALG socket for hashing. /* md5sum2.c */ #include #include #include #include #include #include #include #include #include int main(int argc, char **argv) { int sk = socket(AF_ALG, SOCK_SEQPACKET, 0); struct stat st; struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = "hash", .salg_name = "md5", }; int n; bind(sk, (struct sockaddr*)&sa, sizeof(sa)); for (n = 1; n < argc; n++) { int size; int offset = 0; char buf[4096]; int fd; int sko; int i; fd = open(argv[n], O_RDONLY); sko = accept(sk, NULL, 0); fstat(fd, &st); size = st.st_size; sendfile(sko, fd, &offset, size); size = read(sko, buf, sizeof(buf)); for (i = 0; i < size; i++) printf("%2.2x", buf[i]); printf(" %s\n", argv[n]); close(fd); close(sko); } exit(0); } Test below is done using official linux patch files. First result is with a software based md5sum. Second result is with the program above. root@vgoip:~# ls -l patch-3.6.* -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz 5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz After investivation, it appears that sendfile() sends the files by blocks of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each block, the SPLICE_F_MORE flag is missing, therefore the hashing operation is reset as if it was the end of the file. This patch adds SPLICE_F_MORE to the flags when more data is pending. With the patch applied, we get the correct sums: root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz Signed-off-by: Christophe Leroy Signed-off-by: Jens Axboe Cc: Ben Hutchings Signed-off-by: Zefan Li commit b381fbc509052d07ccf8641fd7560a25d46aaf1e Author: Ben Hutchings Date: Sat Feb 13 02:34:52 2016 +0000 pipe: Fix buffer offset after partially failed read Quoting the RHEL advisory: > It was found that the fix for CVE-2015-1805 incorrectly kept buffer > offset and buffer length in sync on a failed atomic read, potentially > resulting in a pipe buffer state corruption. A local, unprivileged user > could use this flaw to crash the system or leak kernel memory to user > space. (CVE-2016-0774, Moderate) The same flawed fix was applied to stable branches from 2.6.32.y to 3.14.y inclusive, and I was able to reproduce the issue on 3.2.y. We need to give pipe_iov_copy_to_user() a separate offset variable and only update the buffer offset if it succeeds. References: https://rhn.redhat.com/errata/RHSA-2016-0103.html Signed-off-by: Ben Hutchings Cc: Jeffrey Vander Stoep Signed-off-by: Zefan Li commit e08cc94c26fab53cf0d2c655ecdcaf39d31dd18a Author: Ben Hutchings Date: Wed Nov 18 02:01:21 2015 +0000 usb: Use the USB_SS_MULT() macro to decode burst multiplier for log message commit 5377adb092664d336ac212499961cac5e8728794 upstream. usb_parse_ss_endpoint_companion() now decodes the burst multiplier correctly in order to check that it's <= 3, but still uses the wrong expression if warning that it's > 3. Fixes: ff30cbc8da42 ("usb: Use the USB_SS_MULT() macro to get the ...") Signed-off-by: Ben Hutchings Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit 9237baa5c61ef9e11d8a71d02d73b53d8a2b7d01 Author: Nate Dailey Date: Mon Feb 29 10:43:58 2016 -0500 raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang commit ccfc7bf1f09d6190ef86693ddc761d5fe3fa47cb upstream. If raid1d is handling a mix of read and write errors, handle_read_error's call to freeze_array can get stuck. This can happen because, though the bio_end_io_list is initially drained, writes can be added to it via handle_write_finished as the retry_list is processed. These writes contribute to nr_pending but are not included in nr_queued. If a later entry on the retry_list triggers a call to handle_read_error, freeze array hangs waiting for nr_pending == nr_queued+extra. The writes on the bio_end_io_list aren't included in nr_queued so the condition will never be satisfied. To prevent the hang, include bio_end_io_list writes in nr_queued. There's probably a better way to handle decrementing nr_queued, but this seemed like the safest way to avoid breaking surrounding code. I'm happy to supply the script I used to repro this hang. Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.) Signed-off-by: Nate Dailey Signed-off-by: Shaohua Li Signed-off-by: Zefan Li commit 4b7e6b747c90c912340b5f3f3c876d81d60cf273 Author: Dāvis Mosāns Date: Fri Aug 21 07:29:22 2015 +0300 mvsas: Fix NULL pointer dereference in mvs_slot_task_free commit 2280521719e81919283b82902ac24058f87dfc1b upstream. When pci_pool_alloc fails in mvs_task_prep then task->lldd_task stays NULL but it's later used in mvs_abort_task as slot which is passed to mvs_slot_task_free causing NULL pointer dereference. Just return from mvs_slot_task_free when passed with NULL slot. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101891 Signed-off-by: Dāvis Mosāns Reviewed-by: Tomas Henzl Reviewed-by: Johannes Thumshirn Signed-off-by: James Bottomley Signed-off-by: Zefan Li commit ede7386c3e8605a23c9ea89530894ce074397aa6 Author: Mike Snitzer Date: Thu Oct 22 10:56:40 2015 -0400 dm btree: fix leak of bufio-backed block in btree_split_beneath error path commit 4dcb8b57df3593dcb20481d9d6cf79d1dc1534be upstream. btree_split_beneath()'s error path had an outstanding FIXME that speaks directly to the potential for _not_ cleaning up a previously allocated bufio-backed block. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka Signed-off-by: Mike Snitzer Acked-by: Joe Thornber Signed-off-by: Zefan Li commit e2c8a2c8819e0840e85aee5132f0925f92c047c8 Author: Jan Kara Date: Thu Oct 22 13:32:21 2015 -0700 mm: make sendfile(2) killable commit 296291cdd1629c308114504b850dc343eabc2782 upstream. Currently a simple program below issues a sendfile(2) system call which takes about 62 days to complete in my test KVM instance. int fd; off_t off = 0; fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644); ftruncate(fd, 2); lseek(fd, 0, SEEK_END); sendfile(fd, fd, &off, 0xfffffff); Now you should not ask kernel to do a stupid stuff like copying 256MB in 2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin should have a way to stop you. We actually do have a check for fatal_signal_pending() in generic_perform_write() which triggers in this path however because we always succeed in writing something before the check is done, we return value > 0 from generic_perform_write() and thus the information about signal gets lost. Fix the problem by doing the signal check before writing anything. That way generic_perform_write() returns -EINTR, the error gets propagated up and the sendfile loop terminates early. Signed-off-by: Jan Kara Reported-by: Dmitry Vyukov Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Zefan Li commit b1d64b01ee7d043373a068c6b58b058876e56c7d Author: Ilia Mirkin Date: Tue Oct 20 01:15:39 2015 -0400 drm/nouveau/gem: return only valid domain when there's only one commit 2a6c521bb41ce862e43db46f52e7681d33e8d771 upstream. On nv50+, we restrict the valid domains to just the one where the buffer was originally created. However after the buffer is evicted to system memory, we might move it back to a different domain that was not originally valid. When sharing the buffer and retrieving its GEM_INFO data, we still want the domain that will be valid for this buffer in a pushbuf, not the one where it currently happens to be. This resolves fdo#92504 and several others. These are due to suspend evicting all buffers, making it more likely that they temporarily end up in the wrong place. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92504 Signed-off-by: Ilia Mirkin Signed-off-by: Ben Skeggs Signed-off-by: Zefan Li commit f9a3e62aa95e2f8c62e3d2c8188cbfe9fbfa1290 Author: Joerg Roedel Date: Tue Oct 20 14:59:36 2015 +0200 iommu/amd: Don't clear DTE flags when modifying it commit cbf3ccd09d683abf1cacd36e3640872ee912d99b upstream. During device assignment/deassignment the flags in the DTE get lost, which might cause spurious faults, for example when the device tries to access the system management range. Fix this by not clearing the flags with the rest of the DTE. Reported-by: G. Richard Bellamy Tested-by: G. Richard Bellamy Signed-off-by: Joerg Roedel Signed-off-by: Zefan Li commit b5fb46e134ce146fc20672e9b2a57da977c5f26f Author: Charles Keepax Date: Tue Oct 20 10:25:58 2015 +0100 ASoC: wm8904: Correct number of EQ registers commit 97aff2c03a1e4d343266adadb52313613efb027f upstream. There are 24 EQ registers not 25, I suspect this bug came about because the registers start at EQ1 not zero. The bug is relatively harmless as the extra register written is an unused one. Signed-off-by: Charles Keepax Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit 394d806071beb3be46d7b3922af27039f283b57f Author: Herbert Xu Date: Mon Oct 19 18:23:57 2015 +0800 crypto: api - Only abort operations on fatal signal commit 3fc89adb9fa4beff31374a4bf50b3d099d88ae83 upstream. Currently a number of Crypto API operations may fail when a signal occurs. This causes nasty problems as the caller of those operations are often not in a good position to restart the operation. In fact there is currently no need for those operations to be interrupted by user signals at all. All we need is for them to be killable. This patch replaces the relevant calls of signal_pending with fatal_signal_pending, and wait_for_completion_interruptible with wait_for_completion_killable, respectively. Signed-off-by: Herbert Xu Signed-off-by: Zefan Li commit 4333b97f9ab15347378efda6370f9eea974ad94a Author: Laura Abbott Date: Mon Oct 12 11:30:13 2015 +0300 xhci: Add spurious wakeup quirk for LynxPoint-LP controllers commit fd7cd061adcf5f7503515ba52b6a724642a839c8 upstream. We received several reports of systems rebooting and powering on after an attempted shutdown. Testing showed that setting XHCI_SPURIOUS_WAKEUP quirk in addition to the XHCI_SPURIOUS_REBOOT quirk allowed the system to shutdown as expected for LynxPoint-LP xHCI controllers. Set the quirk back. Note that the quirk was originally introduced for LynxPoint and LynxPoint-LP just for this same reason. See: commit 638298dc66ea ("xhci: Fix spurious wakeups after S5 on Haswell") It was later limited to only concern HP machines as it caused regression on some machines, see both bug and commit: Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=66171 commit 6962d914f317 ("xhci: Limit the spurious wakeup fix only to HP machines") Later it was discovered that the powering on after shutdown was limited to LynxPoint-LP (Haswell-ULT) and that some non-LP HP machine suffered from spontaneous resume from S3 (which should not be related to the SPURIOUS_WAKEUP quirk at all). An attempt to fix this then removed the SPURIOUS_WAKEUP flag usage completely. commit b45abacde3d5 ("xhci: no switching back on non-ULT Haswell") Current understanding is that LynxPoint-LP (Haswell ULT) machines need the SPURIOUS_WAKEUP quirk, otherwise they will restart, and plain Lynxpoint (Haswell) machines may _not_ have the quirk set otherwise they again will restart. Signed-off-by: Laura Abbott Cc: Takashi Iwai Cc: Oliver Neukum [Added more history to commit message -Mathias] Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit 720083a9e3257eee6efa05eae310d3250f5270a5 Author: Mathias Nyman Date: Mon Oct 12 11:30:12 2015 +0300 xhci: handle no ping response error properly commit 3b4739b8951d650becbcd855d7d6f18ac98a9a85 upstream. If a host fails to wake up a isochronous SuperSpeed device from U1/U2 in time for a isoch transfer it will generate a "No ping response error" Host will then move to the next transfer descriptor. Handle this case in the same way as missed service errors, tag the current TD as skipped and handle it on the next transfer event. Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit d425239d4fa8ba4aa7ffa6d9e4a2584563bdad16 Author: Christian Zander Date: Wed Jun 10 09:41:45 2015 -0700 iommu/vt-d: fix range computation when making room for large pages commit ba2374fd2bf379f933773811fdb06cb6a5445f41 upstream. In preparation for the installation of a large page, any small page tables that may still exist in the target IOV address range are removed. However, if a scatter/gather list entry is large enough to fit more than one large page, the address space for any subsequent large pages is not cleared of conflicting small page tables. This can cause legitimate mapping requests to fail with errors of the form below, potentially followed by a series of IOMMU faults: ERROR: DMA PTE for vPFN 0xfde00 already set (to 7f83a4003 not 7e9e00083) In this example, a 4MiB scatter/gather list entry resulted in the successful installation of a large page @ vPFN 0xfdc00, followed by a failed attempt to install another large page @ vPFN 0xfde00, due to the presence of a pointer to a small page table @ 0x7f83a4000. To address this problem, compute the number of large pages that fit into a given scatter/gather list entry, and use it to derive the last vPFN covered by the large page(s). Signed-off-by: Christian Zander Signed-off-by: David Woodhouse [bwh: Backported to 3.2: - Add the lvl_pages variable, added by an earlier commit upstream - Also change arguments to dma_pte_clear_range(), which is called by dma_pte_free_pagetable() upstream] Signed-off-by: Ben Hutchings Signed-off-by: Zefan Li commit 2147886bdb36b55873f83ccd0322b2f390bfee4b Author: Russell King Date: Fri Oct 9 20:43:33 2015 +0100 crypto: ahash - ensure statesize is non-zero commit 8996eafdcbad149ac0f772fb1649fbb75c482a6a upstream. Unlike shash algorithms, ahash drivers must implement export and import as their descriptors may contain hardware state and cannot be exported as is. Unfortunately some ahash drivers did not provide them and end up causing crashes with algif_hash. This patch adds a check to prevent these drivers from registering ahash algorithms until they are fixed. Signed-off-by: Russell King Signed-off-by: Herbert Xu Signed-off-by: Zefan Li commit efb049430f508a269d6b9214e3a7138edec0fbfb Author: Cathy Avery Date: Fri Oct 2 09:35:01 2015 -0400 xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing) commit a54c8f0f2d7df525ff997e2afe71866a1a013064 upstream. xen-blkfront will crash if the check to talk_to_blkback() in blkback_changed()(XenbusStateInitWait) returns an error. The driver data is freed and info is set to NULL. Later during the close process via talk_to_blkback's call to xenbus_dev_fatal() the null pointer is passed to and dereference in blkfront_closing. Signed-off-by: Cathy Avery Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Zefan Li commit de7f6bfbcce6468304cb9c1c4bb2e5b7c6616778 Author: Takashi Iwai Date: Mon Oct 5 16:55:09 2015 +0200 ALSA: synth: Fix conflicting OSS device registration on AWE32 commit 225db5762dc1a35b26850477ffa06e5cd0097243 upstream. When OSS emulation is loaded on ISA SB AWE32 chip, we get now kernel warnings like: WARNING: CPU: 0 PID: 2791 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x51/0x80() sysfs: cannot create duplicate filename '/devices/isa/sbawe.0/sound/card0/seq-oss-0-0' It's because both emux synth and opl3 drivers try to register their OSS device object with the same static index number 0. This hasn't been a big problem until the recent rewrite of device management code (that exposes sysfs at the same time), but it's been an obvious bug. This patch works around it just by using a different index number of emux synth object. There can be a more elegant way to fix, but it's enough for now, as this code won't be touched so often, in anyway. Reported-and-tested-by: Michael Shell Signed-off-by: Takashi Iwai Signed-off-by: Zefan Li commit 3f258c664a0c84414c4ec17128fc877720866425 Author: Jann Horn Date: Sun Oct 4 19:29:12 2015 +0200 drivers/tty: require read access for controlling terminal commit 0c55627167870255158db1cde0d28366f91c8872 upstream. This is mostly a hardening fix, given that write-only access to other users' ttys is usually only given through setgid tty executables. Signed-off-by: Jann Horn Signed-off-by: Greg Kroah-Hartman [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit 3a0b4c1d2b134d8d33793b66de7cc91089c0e33a Author: Kosuke Tatsukawa Date: Fri Oct 2 08:27:05 2015 +0000 tty: fix stall caused by missing memory barrier in drivers/tty/n_tty.c commit e81107d4c6bd098878af9796b24edc8d4a9524fd upstream. My colleague ran into a program stall on a x86_64 server, where n_tty_read() was waiting for data even if there was data in the buffer in the pty. kernel stack for the stuck process looks like below. #0 [ffff88303d107b58] __schedule at ffffffff815c4b20 #1 [ffff88303d107bd0] schedule at ffffffff815c513e #2 [ffff88303d107bf0] schedule_timeout at ffffffff815c7818 #3 [ffff88303d107ca0] wait_woken at ffffffff81096bd2 #4 [ffff88303d107ce0] n_tty_read at ffffffff8136fa23 #5 [ffff88303d107dd0] tty_read at ffffffff81368013 #6 [ffff88303d107e20] __vfs_read at ffffffff811a3704 #7 [ffff88303d107ec0] vfs_read at ffffffff811a3a57 #8 [ffff88303d107f00] sys_read at ffffffff811a4306 #9 [ffff88303d107f50] entry_SYSCALL_64_fastpath at ffffffff815c86d7 There seems to be two problems causing this issue. First, in drivers/tty/n_tty.c, __receive_buf() stores the data and updates ldata->commit_head using smp_store_release() and then checks the wait queue using waitqueue_active(). However, since there is no memory barrier, __receive_buf() could return without calling wake_up_interactive_poll(), and at the same time, n_tty_read() could start to wait in wait_woken() as in the following chart. __receive_buf() n_tty_read() ------------------------------------------------------------------------ if (waitqueue_active(&tty->read_wait)) /* Memory operations issued after the RELEASE may be completed before the RELEASE operation has completed */ add_wait_queue(&tty->read_wait, &wait); ... if (!input_available_p(tty, 0)) { smp_store_release(&ldata->commit_head, ldata->read_head); ... timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout); ------------------------------------------------------------------------ The second problem is that n_tty_read() also lacks a memory barrier call and could also cause __receive_buf() to return without calling wake_up_interactive_poll(), and n_tty_read() to wait in wait_woken() as in the chart below. __receive_buf() n_tty_read() ------------------------------------------------------------------------ spin_lock_irqsave(&q->lock, flags); /* from add_wait_queue() */ ... if (!input_available_p(tty, 0)) { /* Memory operations issued after the RELEASE may be completed before the RELEASE operation has completed */ smp_store_release(&ldata->commit_head, ldata->read_head); if (waitqueue_active(&tty->read_wait)) __add_wait_queue(q, wait); spin_unlock_irqrestore(&q->lock,flags); /* from add_wait_queue() */ ... timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout); ------------------------------------------------------------------------ There are also other places in drivers/tty/n_tty.c which have similar calls to waitqueue_active(), so instead of adding many memory barrier calls, this patch simply removes the call to waitqueue_active(), leaving just wake_up*() behind. This fixes both problems because, even though the memory access before or after the spinlocks in both wake_up*() and add_wait_queue() can sneak into the critical section, it cannot go past it and the critical section assures that they will be serialized (please see "INTER-CPU ACQUIRING BARRIER EFFECTS" in Documentation/memory-barriers.txt for a better explanation). Moreover, the resulting code is much simpler. Latency measurement using a ping-pong test over a pty doesn't show any visible performance drop. Signed-off-by: Kosuke Tatsukawa Signed-off-by: Greg Kroah-Hartman [lizf: Backported to 3.4: - adjust context - s/wake_up_interruptible_poll/wake_up_interruptible/ - drop changes to __receive_buf() and n_tty_set_termios()] Signed-off-by: Zefan Li commit d6706f05d3d4246d94fc126a3632072a4c8daef6 Author: Vincent Palatin Date: Thu Oct 1 14:10:22 2015 -0700 usb: Add device quirk for Logitech PTZ cameras commit 72194739f54607bbf8cfded159627a2015381557 upstream. Add a device quirk for the Logitech PTZ Pro Camera and its sibling the ConferenceCam CC3000e Camera. This fixes the failed camera enumeration on some boot, particularly on machines with fast CPU. Tested by connecting a Logitech PTZ Pro Camera to a machine with a Haswell Core i7-4600U CPU @ 2.10GHz, and doing thousands of reboot cycles while recording the kernel logs and taking camera picture after each boot. Before the patch, more than 7% of the boots show some enumeration transfer failures and in a few of them, the kernel is giving up before actually enumerating the webcam. After the patch, the enumeration has been correct on every reboot. Signed-off-by: Vincent Palatin Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit 18316131ccdd4cba57f2f96239143e2a8754ece2 Author: Yao-Wen Mao Date: Mon Aug 31 14:24:09 2015 +0800 USB: Add reset-resume quirk for two Plantronics usb headphones. commit 8484bf2981b3d006426ac052a3642c9ce1d8d980 upstream. These two headphones need a reset-resume quirk to properly resume to original volume level. Signed-off-by: Yao-Wen Mao Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit b81cc21d0356a38838502ae2fae709382a324b75 Author: John Stultz Date: Mon Sep 14 18:05:20 2015 -0700 clocksource: Fix abs() usage w/ 64bit values commit 67dfae0cd72fec5cd158b6e5fb1647b7dbe0834c upstream. This patch fixes one cases where abs() was being used with 64-bit nanosecond values, where the result may be capped at 32-bits. This potentially could cause watchdog false negatives on 32-bit systems, so this patch addresses the issue by using abs64(). Signed-off-by: John Stultz Cc: Prarit Bhargava Cc: Richard Cochran Cc: Ingo Molnar Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit e9c235998ff71a2f837c778dbce6a8fc674bdca0 Author: Mel Gorman Date: Thu Oct 1 15:36:57 2015 -0700 mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream. SunDong reported the following on https://bugzilla.kernel.org/show_bug.cgi?id=103841 I think I find a linux bug, I have the test cases is constructed. I can stable recurring problems in fedora22(4.0.4) kernel version, arch for x86_64. I construct transparent huge page, when the parent and child process with MAP_SHARE, MAP_PRIVATE way to access the same huge page area, it has the opportunity to lead to huge page copy on write failure, and then it will munmap the child corresponding mmap area, but then the child mmap area with VM_MAYSHARE attributes, child process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags functions (vma - > vm_flags & VM_MAYSHARE). There were a number of problems with the report (e.g. it's hugetlbfs that triggers this, not transparent huge pages) but it was fundamentally correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that looks like this vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800 prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0 pgoff 0 file ffff88106bdb9800 private_data (null) flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb) ------------ kernel BUG at mm/hugetlb.c:462! SMP Modules linked in: xt_pkttype xt_LOG xt_limit [..] CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 set_vma_resv_flags+0x2d/0x30 The VM_BUG_ON is correct because private and shared mappings have different reservation accounting but the warning clearly shows that the VMA is shared. When a private COW fails to allocate a new page then only the process that created the VMA gets the page -- all the children unmap the page. If the children access that data in the future then they get killed. The problem is that the same file is mapped shared and private. During the COW, the allocation fails, the VMAs are traversed to unmap the other private pages but a shared VMA is found and the bug is triggered. This patch identifies such VMAs and skips them. Signed-off-by: Mel Gorman Reported-by: SunDong Reviewed-by: Michal Hocko Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Naoya Horiguchi Cc: David Rientjes Reviewed-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Zefan Li commit a03288decdc919c95a23729e6f9c8934c5809510 Author: Ben Hutchings Date: Sat Sep 26 12:23:56 2015 +0100 genirq: Fix race in register_irq_proc() commit 95c2b17534654829db428f11bcf4297c059a2a7e upstream. Per-IRQ directories in procfs are created only when a handler is first added to the irqdesc, not when the irqdesc is created. In the case of a shared IRQ, multiple tasks can race to create a directory. This race condition seems to have been present forever, but is easier to hit with async probing. Signed-off-by: Ben Hutchings Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk Signed-off-by: Thomas Gleixner Signed-off-by: Zefan Li commit 12bdf057277f6ef12aee6e62e04d920d558d197f Author: Thomas Gleixner Date: Wed Sep 30 08:38:22 2015 +0000 x86/process: Add proper bound checks in 64bit get_wchan() commit eddd3826a1a0190e5235703d1e666affa4d13b96 upstream. Dmitry Vyukov reported the following using trinity and the memory error detector AddressSanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel). [ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on address ffff88002e280000 [ 124.576801] ffff88002e280000 is located 131938492886538 bytes to the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a) [ 124.578633] Accessed by thread T10915: [ 124.579295] inlined in describe_heap_address ./arch/x86/mm/asan/report.c:164 [ 124.579295] #0 ffffffff810dd277 in asan_report_error ./arch/x86/mm/asan/report.c:278 [ 124.580137] #1 ffffffff810dc6a0 in asan_check_region ./arch/x86/mm/asan/asan.c:37 [ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0 [ 124.581893] #3 ffffffff8107c093 in get_wchan ./arch/x86/kernel/process_64.c:444 The address checks in the 64bit implementation of get_wchan() are wrong in several ways: - The lower bound of the stack is not the start of the stack page. It's the start of the stack page plus sizeof (struct thread_info) - The upper bound must be: top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long). The 2 * sizeof(unsigned long) is required because the stack pointer points at the frame pointer. The layout on the stack is: ... IP FP ... IP FP. So we need to make sure that both IP and FP are in the bounds. Fix the bound checks and get rid of the mix of numeric constants, u64 and unsigned long. Making all unsigned long allows us to use the same function for 32bit as well. Use READ_ONCE() when accessing the stack. This does not prevent a concurrent wakeup of the task and the stack changing, but at least it avoids TOCTOU. Also check task state at the end of the loop. Again that does not prevent concurrent changes, but it avoids walking for nothing. Add proper comments while at it. Reported-by: Dmitry Vyukov Reported-by: Sasha Levin Based-on-patch-from: Wolfram Gloger Signed-off-by: Thomas Gleixner Reviewed-by: Borislav Petkov Reviewed-by: Dmitry Vyukov Cc: Andrey Ryabinin Cc: Andy Lutomirski Cc: Andrey Konovalov Cc: Kostya Serebryany Cc: Alexander Potapenko Cc: kasan-dev Cc: Denys Vlasenko Cc: Andi Kleen Cc: Wolfram Gloger Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de Signed-off-by: Thomas Gleixner [lizf: Backported to 3.4: - s/READ_ONCE/ACCESS_ONCE - remove TOP_OF_KERNEL_STACK_PADDING] Signed-off-by: Zefan Li commit 127cf7cb2ae3657e3ee4141216997b0a8a25df99 Author: shengyong Date: Mon Sep 28 17:57:19 2015 +0000 UBI: return ENOSPC if no enough space available commit 7c7feb2ebfc9c0552c51f0c050db1d1a004faac5 upstream. UBI: attaching mtd1 to ubi0 UBI: scanning is finished UBI error: init_volumes: not enough PEBs, required 706, available 686 UBI error: ubi_wl_init: no enough physical eraseblocks (-20, need 1) UBI error: ubi_attach_mtd_dev: failed to attach mtd1, error -12 <= NOT ENOMEM UBI error: ubi_init: cannot attach mtd1 If available PEBs are not enough when initializing volumes, return -ENOSPC directly. If available PEBs are not enough when initializing WL, return -ENOSPC instead of -ENOMEM. Signed-off-by: Sheng Yong Signed-off-by: Richard Weinberger Reviewed-by: David Gstir Signed-off-by: Zefan Li commit 15acb368e63874fbb7b34d424185921e9a8dc996 Author: Richard Weinberger Date: Tue Sep 22 23:58:07 2015 +0200 UBI: Validate data_size commit 281fda27673f833a01d516658a64d22a32c8e072 upstream. Make sure that data_size is less than LEB size. Otherwise a handcrafted UBI image is able to trigger an out of bounds memory access in ubi_compare_lebs(). Signed-off-by: Richard Weinberger Reviewed-by: David Gstir [lizf: Backported to 3.4: use dbg_err() instead of ubi_err()]; Signed-off-by: Zefan Li commit 9ed559d3d6fd06cb3a20e120f8b55be46c8d33db Author: Malcolm Crossley Date: Mon Sep 28 11:36:52 2015 +0100 x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map commit 64c98e7f49100b637cd20a6c63508caed6bbba7a upstream. Sanitizing the e820 map may produce extra E820 entries which would result in the topmost E820 entries being removed. The removed entries would typically include the top E820 usable RAM region and thus result in the domain having signicantly less RAM available to it. Fix by allowing sanitize_e820_map to use the full size of the allocated E820 array. Signed-off-by: Malcolm Crossley Reviewed-by: Boris Ostrovsky Signed-off-by: David Vrabel [lizf: Backported to 3.4: s/map/xen_e820_map] Signed-off-by: Zefan Li commit d311156acfbc6ec43df3898f7cd95313b6877174 Author: Andreas Schwab Date: Wed Sep 23 23:12:09 2015 +0200 m68k: Define asmlinkage_protect commit 8474ba74193d302e8340dddd1e16c85cc4b98caf upstream. Make sure the compiler does not modify arguments of syscall functions. This can happen if the compiler generates a tailcall to another function. For example, without asmlinkage_protect sys_openat is compiled into this function: sys_openat: clr.l %d0 move.w 18(%sp),%d0 move.l %d0,16(%sp) jbra do_sys_open Note how the fourth argument is modified in place, modifying the register %d4 that gets restored from this stack slot when the function returns to user-space. The caller may expect the register to be unmodified across system calls. Signed-off-by: Andreas Schwab Signed-off-by: Geert Uytterhoeven Signed-off-by: Zefan Li commit edb236d7d193cb7ae1dec654183be24b60334474 Author: Felix Fietkau Date: Thu Sep 24 16:59:46 2015 +0200 ath9k: declare required extra tx headroom commit 029cd0370241641eb70235d205aa0b90c84dce44 upstream. ath9k inserts padding between the 802.11 header and the data area (to align it). Since it didn't declare this extra required headroom, this led to some nasty issues like randomly dropped packets in some setups. Signed-off-by: Felix Fietkau Signed-off-by: Kalle Valo Signed-off-by: Zefan Li commit f5499bfc0b4645b44537cdbda9e58f5dfd0cb2af Author: Joseph Qi Date: Tue Sep 22 14:59:20 2015 -0700 ocfs2/dlm: fix deadlock when dispatch assert master commit 012572d4fc2e4ddd5c8ec8614d51414ec6cae02a upstream. The order of the following three spinlocks should be: dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock But dlm_dispatch_assert_master() is called while holding dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls dlm_grab() which will take dlm_domain_lock. Once another thread (for example, dlm_query_join_handler) has already taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock happens. Signed-off-by: Joseph Qi Cc: Joel Becker Cc: Mark Fasheh Cc: "Junxiao Bi" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit 6fa2028d94598e9a98ef4420ef236301be6b3cfd Author: Peter Seiderer Date: Thu Sep 17 21:40:12 2015 +0200 cifs: use server timestamp for ntlmv2 authentication commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream. Linux cifs mount with ntlmssp against an Mac OS X (Yosemite 10.10.5) share fails in case the clocks differ more than +/-2h: digest-service: digest-request: od failed with 2 proto=ntlmv2 digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2 Fix this by (re-)using the given server timestamp for the ntlmv2 authentication (as Windows 7 does). A related problem was also reported earlier by Namjae Jaen (see below): Windows machine has extended security feature which refuse to allow authentication when there is time difference between server time and client time when ntlmv2 negotiation is used. This problem is prevalent in embedded enviornment where system time is set to default 1970. Modern servers send the server timestamp in the TargetInfo Av_Pair structure in the challenge message [see MS-NLMP 2.2.2.1] In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must use the server provided timestamp if present OR current time if it is not Reported-by: Namjae Jeon Signed-off-by: Peter Seiderer Signed-off-by: Steve French [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit 276a6c94ded4ad29a4f86381c5f80b840b8a7030 Author: Mathias Nyman Date: Mon Sep 21 17:46:16 2015 +0300 xhci: change xhci 1.0 only restrictions to support xhci 1.1 commit dca7794539eff04b786fb6907186989e5eaaa9c2 upstream. Some changes between xhci 0.96 and xhci 1.0 specifications forced us to check the hci version in code, some of these checks were implemented as hci_version == 1.0, which will not work with new xhci 1.1 controllers. xhci 1.1 behaves similar to xhci 1.0 in these cases, so change these checks to hci_version >= 1.0 Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit fa8600fa40e5ca152b8ece604579535b4296145f Author: Roger Quadros Date: Mon Sep 21 17:46:13 2015 +0300 usb: xhci: Clear XHCI_STATE_DYING on start commit e5bfeab0ad515b4f6df39fe716603e9dc6d3dfd0 upstream. For whatever reason if XHCI died in the previous instant then it will never recover on the next xhci_start unless we clear the DYING flag. Signed-off-by: Roger Quadros Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit 63a2bddd9e9d605149860e819cca40f79835eeed Author: Mathias Nyman Date: Mon Sep 21 17:46:10 2015 +0300 xhci: give command abortion one more chance before killing xhci commit a6809ffd1687b3a8c192960e69add559b9d32649 upstream. We want to give the command abortion an additional try to stop the command ring before we completely hose xhci. Tested-by: Vincent Pelletier Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman [lizf: Backported to 3.4: call handshake() instead of xhci_handshake()] Signed-off-by: Zefan Li commit 40dba0fdd43511ea6f24b432c39edb39479eff59 Author: Mathias Nyman Date: Mon Sep 21 17:46:09 2015 +0300 usb: Use the USB_SS_MULT() macro to get the burst multiplier. commit ff30cbc8da425754e8ab96904db1d295bd034f27 upstream. Bits 1:0 of the bmAttributes are used for the burst multiplier. The rest of the bits used to be reserved (zero), but USB3.1 takes bit 7 into use. Use the existing USB_SS_MULT() macro instead to make sure the mult value and hence max packet calculations are correct for USB3.1 devices. Note that burst multiplier in bmAttributes is zero based and that the USB_SS_MULT() macro adds one. Signed-off-by: Mathias Nyman Signed-off-by: Greg Kroah-Hartman Signed-off-by: Zefan Li commit 7d148ce4451a12ca1b226373ccda950718154140 Author: Paolo Bonzini Date: Fri Sep 18 17:33:04 2015 +0200 KVM: x86: trap AMD MSRs for the TSeg base and mask commit 3afb1121800128aae9f5722e50097fcf1a9d4d88 upstream. These have roughly the same purpose as the SMRR, which we do not need to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at boot, which causes problems when running a Xen dom0 under KVM. Just return 0, meaning that processor protection of SMRAM is not in effect. Reported-by: M A Young Acked-by: Borislav Petkov Signed-off-by: Paolo Bonzini Signed-off-by: Zefan Li commit 42bffe1abdfe0543b54c0d29550e9a2bcd9edd34 Author: Mark Brown Date: Sat Sep 19 07:12:34 2015 -0700 regmap: debugfs: Don't bother actually printing when calculating max length commit 176fc2d5770a0990eebff903ba680d2edd32e718 upstream. The in kernel snprintf() will conveniently return the actual length of the printed string even if not given an output beffer at all so just do that rather than relying on the user to pass in a suitable buffer, ensuring that we don't need to worry if the buffer was truncated due to the size of the buffer passed in. Reported-by: Rasmus Villemoes Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit f4524d728fa32ea451c6034a6dccba0159ecc153 Author: Mark Brown Date: Sat Sep 19 07:00:18 2015 -0700 regmap: debugfs: Ensure we don't underflow when printing access masks commit b763ec17ac762470eec5be8ebcc43e4f8b2c2b82 upstream. If a read is attempted which is smaller than the line length then we may underflow the subtraction we're doing with the unsigned size_t type so move some of the calculation to be additions on the right hand side instead in order to avoid this. Reported-by: Rasmus Villemoes Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit 413e7340b4ceac04d90253c25472a221e2f529ff Author: Jeff Mahoney Date: Fri Sep 11 21:44:17 2015 -0400 btrfs: skip waiting on ordered range for special files commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream. In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl Signed-off-by: Jeff Mahoney Reviewed-by: Filipe Manana Signed-off-by: Zefan Li commit 557b53d8d4c7dbcc0615c2eeace00fd91bd2a17f Author: Guenter Roeck Date: Sun Sep 6 01:46:54 2015 +0300 spi: Fix documentation of spi_alloc_master() commit a394d635193b641f2c86ead5ada5b115d57c51f8 upstream. Actually, spi_master_put() after spi_alloc_master() must _not_ be followed by kfree(). The memory is already freed with the call to spi_master_put() through spi_master_class, which registers a release function. Calling both spi_master_put() and kfree() results in often nasty (and delayed) crashes elsewhere in the kernel, often in the networking stack. This reverts commit eb4af0f5349235df2e4a5057a72fc8962d00308a. Link to patch and concerns: https://lkml.org/lkml/2012/9/3/269 or http://lkml.iu.edu/hypermail/linux/kernel/1209.0/00790.html Alexey Klimov: This revert becomes valid after 94c69f765f1b4a658d96905ec59928e3e3e07e6a when spi-imx.c has been fixed and there is no need to call kfree() so comment for spi_alloc_master() should be fixed. Signed-off-by: Guenter Roeck Signed-off-by: Alexey Klimov Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit 47d7e7e7c22274dfaba2e25d4a8baff86989757c Author: Tan, Jui Nee Date: Tue Sep 1 10:22:51 2015 +0800 spi: spi-pxa2xx: Check status register to determine if SSSR_TINT is disabled commit 02bc933ebb59208f42c2e6305b2c17fd306f695d upstream. On Intel Baytrail, there is case when interrupt handler get called, no SPI message is captured. The RX FIFO is indeed empty when RX timeout pending interrupt (SSSR_TINT) happens. Use the BIOS version where both HSUART and SPI are on the same IRQ. Both drivers are using IRQF_SHARED when calling the request_irq function. When running two separate and independent SPI and HSUART application that generate data traffic on both components, user will see messages like below on the console: pxa2xx-spi pxa2xx-spi.0: bad message state in interrupt handler This commit will fix this by first checking Receiver Time-out Interrupt, if it is disabled, ignore the request and return without servicing. Signed-off-by: Tan, Jui Nee Acked-by: Jarkko Nikula Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit 2f5b9f275560942c8c1a56270bb1d7c8cbd885df Author: Dan Carpenter Date: Thu Oct 29 16:37:54 2015 +0300 drm: crtc: integer overflow in drm_property_create_blob() commit 9ac0934bbe52290e4e4c2a58ec41cab9b6ca8c96 upstream. The size here comes from the user via the ioctl, it is a number between 1-u32max so the addition here could overflow on 32 bit systems. Fixes: f453ba046074 ('DRM: add mode setting support') Signed-off-by: Dan Carpenter Reviewed-by: Daniel Stone Signed-off-by: Dave Airlie [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit 6126604d3fafa03231cadbdbef2d8cd5faa00085 Author: NeilBrown Date: Sat Oct 24 16:02:16 2015 +1100 md/raid1: don't clear bitmap bit when bad-block-list write fails. commit bd8688a199b864944bf62eebed0ca13b46249453 upstream. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we *don't* delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R1BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-and-tested-by: Nate Dailey Cc: Jes Sorensen Fixes: cd5ff9a16f08 ("md/raid1: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown Signed-off-by: Zefan Li commit f6b1d7cb981875eb995795eb9987be6d7825099e Author: NeilBrown Date: Fri Aug 14 11:11:10 2015 +1000 md/raid1: ensure device failure recorded before write request returns. commit 55ce74d4bfe1b9444436264c637f39a152d1e5ac upstream. When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown Signed-off-by: Zefan Li commit 0570dab32ddd0f0c7db5ccd025a1597fffb9e464 Author: NeilBrown Date: Sat Oct 24 16:23:48 2015 +1100 md/raid10: don't clear bitmap bit when bad-block-list write fails. commit c340702ca26a628832fade4f133d8160a55c29cc upstream. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we *don't* delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: 95af587e95aa ("md/raid10: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R10BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-by: Nate Dailey Fixes: bd870a16c594 ("md/raid10: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown Signed-off-by: Zefan Li commit 43bf02bac43bd2697f9e20c0feee434ba6ebe9db Author: NeilBrown Date: Fri Aug 14 11:26:17 2015 +1000 md/raid10: ensure device failure recorded before write request returns. commit 95af587e95aacb9cfda4a9641069a5244a540dc8 upstream. When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li commit 894f53c9eceea3dd9af3197e1419538ce05b3ffa Author: Vasant Hegde Date: Fri Oct 16 15:53:29 2015 +0530 powerpc/rtas: Validate rtas.entry before calling enter_rtas() commit 8832317f662c06f5c06e638f57bfe89a71c9b266 upstream. Currently we do not validate rtas.entry before calling enter_rtas(). This leads to a kernel oops when user space calls rtas system call on a powernv platform (see below). This patch adds code to validate rtas.entry before making enter_rtas() call. Oops: Exception in kernel mode, sig: 4 [#1] SMP NR_CPUS=1024 NUMA PowerNV task: c000000004294b80 ti: c0000007e1a78000 task.ti: c0000007e1a78000 NIP: 0000000000000000 LR: 0000000000009c14 CTR: c000000000423140 REGS: c0000007e1a7b920 TRAP: 0e40 Not tainted (3.18.17-340.el7_1.pkvm3_1_0.2400.1.ppc64le) MSR: 1000000000081000 CR: 00000000 XER: 00000000 CFAR: c000000000009c0c SOFTE: 0 NIP [0000000000000000] (null) LR [0000000000009c14] 0x9c14 Call Trace: [c0000007e1a7bba0] [c00000000041a7f4] avc_has_perm_noaudit+0x54/0x110 (unreliable) [c0000007e1a7bd80] [c00000000002ddc0] ppc_rtas+0x150/0x2d0 [c0000007e1a7be30] [c000000000009358] syscall_exit+0x0/0x98 Fixes: 55190f88789a ("powerpc: Add skeleton PowerNV platform") Reported-by: NAGESWARA R. SASTRY Signed-off-by: Vasant Hegde [mpe: Reword change log, trim oops, and add stable + fixes] Signed-off-by: Michael Ellerman Signed-off-by: Zefan Li commit 7abd07f2a328030ae7d68c19f0facf59121bc647 Author: Doron Tsur Date: Sun Oct 11 15:58:17 2015 +0300 IB/cm: Fix rb-tree duplicate free and use-after-free commit 0ca81a2840f77855bbad1b9f172c545c4dc9e6a4 upstream. ib_send_cm_sidr_rep could sometimes erase the node from the sidr (depending on errors in the process). Since ib_send_cm_sidr_rep is called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv could be either erased from the rb_tree twice or not erased at all. Fixing that by making sure it's erased only once before freeing cm_id_priv. Fixes: a977049dacde ('[PATCH] IB: Add the kernel CM implementation') Signed-off-by: Doron Tsur Signed-off-by: Matan Barak Signed-off-by: Doug Ledford Signed-off-by: Zefan Li commit a12321d34f35fb8eacb3f39d1d53eb8d6e52fa8a Author: Peter Zijlstra Date: Tue Sep 29 14:45:09 2015 +0200 sched/core: Fix TASK_DEAD race in finish_task_switch() commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream. So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov Signed-off-by: Peter Zijlstra (Intel) Acked-by: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar [lizf: Backported to 3.4: use smb_mb() instead of smp_store_release(), which is not defined in 3.4.y] Signed-off-by: Zefan Li commit c2acc6aa8577494fe6e8830922b4cabe956fdd20 Author: Johannes Berg Date: Tue Sep 15 14:36:09 2015 +0200 iwlwifi: dvm: fix D3 firmware PN programming commit 5bd166872d8f99f156fac191299d24f828bb2348 upstream. The code to send the RX PN data (for each TID) to the firmware has a devastating bug: it overwrites the data for TID 0 with all the TID data, leaving the remaining TIDs zeroed. This will allow replays to actually be accepted by the firmware, which could allow waking up the system. Signed-off-by: Johannes Berg Signed-off-by: Luca Coelho [lizf: Backported to 3.4: adjust filename] Signed-off-by: Zefan Li commit 55555bf1c6f353b84a23e681f3b487152f730926 Author: NeilBrown Date: Thu Sep 24 15:47:47 2015 +1000 md/raid0: apply base queue limits *before* disk_stack_limits commit 66eefe5de11db1e0d8f2edc3880d50e7c36a9d43 upstream. Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: 199dc6ed5179 ("md/raid0: update queue parameter in a safer location.") Reported-by: Jes Sorensen Signed-off-by: NeilBrown Signed-off-by: Zefan Li commit 65cee714454e3210a673ba4b33f22fac3af77021 Author: James Hogan Date: Fri Mar 27 08:33:43 2015 +0000 MIPS: dma-default: Fix 32-bit fall back to GFP_DMA commit 53960059d56ecef67d4ddd546731623641a3d2d1 upstream. If there is a DMA zone (usually 24bit = 16MB I believe), but no DMA32 zone, as is the case for some 32-bit kernels, then massage_gfp_flags() will cause DMA memory allocated for devices with a 32..63-bit coherent_dma_mask to fall back to using __GFP_DMA, even though there may only be 32-bits of physical address available anyway. Correct that case to compare against a mask the size of phys_addr_t instead of always using a 64-bit mask. Signed-off-by: James Hogan Fixes: a2e715a86c6d ("MIPS: DMA: Fix computation of DMA flags from device's coherent_dma_mask.") Cc: Ralf Baechle Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/9610/ Signed-off-by: Ralf Baechle Signed-off-by: Zefan Li commit f9c715eeb2a73e61005f0f12caed9cf3a0e9dacd Author: Robert Jarzmik Date: Tue Sep 15 20:51:31 2015 +0200 ASoC: fix broken pxa SoC support commit 3c8f7710c1c44fb650bc29b6ef78ed8b60cfaa28 upstream. The previous fix of pxa library support, which was introduced to fix the library dependency, broke the previous SoC behavior, where a machine code binding pxa2xx-ac97 with a coded relied on : - sound/soc/pxa/pxa2xx-ac97.c - sound/soc/codecs/XXX.c For example, the mioa701_wm9713.c machine code is currently broken. The "select ARM" statement wrongly selects the soc/arm/pxa2xx-ac97 for compilation, as per an unfortunate fate SND_PXA2XX_AC97 is both declared in sound/arm/Kconfig and sound/soc/pxa/Kconfig. Fix this by ensuring that SND_PXA2XX_SOC correctly triggers the correct pxa2xx-ac97 compilation. Fixes: 846172dfe33c ("ASoC: fix SND_PXA2XX_LIB Kconfig warning") Signed-off-by: Robert Jarzmik Signed-off-by: Mark Brown Signed-off-by: Zefan Li commit 244e0dbc02ee3af070bd100a840bd112b769e35a Author: Herbert Xu Date: Fri Sep 4 13:21:06 2015 +0800 ipv6: Fix IPsec pre-encap fragmentation check commit 93efac3f2e03321129de67a3c0ba53048bb53e31 upstream. The IPv6 IPsec pre-encap path performs fragmentation for tunnel-mode packets. That is, we perform fragmentation pre-encap rather than post-encap. A check was added later to ensure that proper MTU information is passed back for locally generated traffic. Unfortunately this check was performed on all IPsec packets, including transport-mode packets. What's more, the check failed to take GSO into account. The end result is that transport-mode GSO packets get dropped at the check. This patch fixes it by moving the tunnel mode check forward as well as adding the GSO check. Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") Signed-off-by: Herbert Xu Signed-off-by: Steffen Klassert [lizf: Backported to 3.4: - adjust context - s/ignore_df/local_df] Signed-off-by: Zefan Li commit d8776fffaeff0ae3ecf78b33515f1a8624f78ace Author: Peter Zijlstra Date: Thu Aug 20 10:34:59 2015 +0930 module: Fix locking in symbol_put_addr() commit 275d7d44d802ef271a42dc87ac091a495ba72fc5 upstream. Poma (on the way to another bug) reported an assertion triggering: [] module_assert_mutex_or_preempt+0x49/0x90 [] __module_address+0x32/0x150 [] __module_text_address+0x16/0x70 [] symbol_put_addr+0x29/0x40 [] dvb_frontend_detach+0x7d/0x90 [dvb_core] Laura Abbott produced a patch which lead us to inspect symbol_put_addr(). This function has a comment claiming it doesn't need to disable preemption around the module lookup because it holds a reference to the module it wants to find, which therefore cannot go away. This is wrong (and a false optimization too, preempt_disable() is really rather cheap, and I doubt any of this is on uber critical paths, otherwise it would've retained a pointer to the actual module anyway and avoided the second lookup). While its true that the module cannot go away while we hold a reference on it, the data structure we do the lookup in very much _CAN_ change while we do the lookup. Therefore fix the comment and add the required preempt_disable(). Reported-by: poma Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Rusty Russell Fixes: a6e6abd575fc ("module: remove module_text_address()") Signed-off-by: Zefan Li commit 2bba66d6ae0f8b4c6fd7b7010437d29c0ecfff0a Author: David Woodhouse Date: Wed Sep 16 14:10:03 2015 +0100 x86/platform: Fix Geode LX timekeeping in the generic x86 build commit 03da3ff1cfcd7774c8780d2547ba0d995f7dc03d upstream. In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable") bypassed verification of the TSC on Geode LX. However, this code (now in the check_system_tsc_reliable() function in arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was set. OpenWRT has recently started building its generic Geode target for Geode GX, not LX, to include support for additional platforms. This broke the timekeeping on LX-based devices, because the TSC wasn't marked as reliable: https://dev.openwrt.org/ticket/20531 By adding a runtime check on is_geode_lx(), we can also include the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus fixing the problem. Signed-off-by: David Woodhouse Cc: Andres Salomon Cc: Linus Torvalds Cc: Marcelo Tosatti Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org Signed-off-by: Ingo Molnar Signed-off-by: Zefan Li commit 31a526445157a00b43d5018b1fdc2b8aa6c84c78 Author: Russell King Date: Fri Sep 11 16:44:02 2015 +0100 ARM: fix Thumb2 signal handling when ARMv6 is enabled commit 9b55613f42e8d40d5c9ccb8970bde6af4764b2ab upstream. When a kernel is built covering ARMv6 to ARMv7, we omit to clear the IT state when entering a signal handler. This can cause the first few instructions to be conditionally executed depending on the parent context. In any case, the original test for >= ARMv7 is broken - ARMv6 can have Thumb-2 support as well, and an ARMv6T2 specific build would omit this code too. Relax the test back to ARMv6 or greater. This results in us always clearing the IT state bits in the PSR, even on CPUs where these bits are reserved. However, they're reserved for the IT state, so this should cause no harm. Fixes: d71e1352e240 ("Clear the IT state when invoking a Thumb-2 signal handler") Acked-by: Tony Lindgren Tested-by: H. Nikolaus Schaller Tested-by: Grazvydas Ignotas Signed-off-by: Russell King Signed-off-by: Zefan Li commit b2e30785526c7e1b7bba74d5c50784cfcfe0bc21 Author: T.J. Purtell Date: Wed Nov 6 18:38:05 2013 +0100 ARM: 7880/1: Clear the IT state independent of the Thumb-2 mode commit 6ecf830e5029598732e04067e325d946097519cb upstream. The ARM architecture reference specifies that the IT state bits in the PSR must be all zeros in ARM mode or behavior is unspecified. On the Qualcomm Snapdragon S4/Krait architecture CPUs the processor continues to consider the IT state bits while in ARM mode. This makes it so that some instructions are skipped by the CPU. Signed-off-by: T.J. Purtell [rmk+kernel@arm.linux.org.uk: fixed whitespace formatting in patch] Signed-off-by: Russell King Signed-off-by: Zefan Li commit 98e57bab3f696bf9642899b57ed733d122f3ed4e Author: Arnaldo Carvalho de Melo Date: Fri Sep 11 12:36:12 2015 -0300 perf header: Fixup reading of HEADER_NRCPUS feature commit caa470475d9b59eeff093ae650800d34612c4379 upstream. The original patch introducing this header wrote the number of CPUs available and online in one order and then swapped those values when reading, fix it. Before: # perf record usleep 1 # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 4 # nrcpus avail : 4 # echo 0 > /sys/devices/system/cpu/cpu2/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 4 # nrcpus avail : 3 # echo 0 > /sys/devices/system/cpu/cpu1/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 4 # nrcpus avail : 2 After the fix, bringing back the CPUs online: # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 2 # nrcpus avail : 4 # echo 1 > /sys/devices/system/cpu/cpu2/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 3 # nrcpus avail : 4 # echo 1 > /sys/devices/system/cpu/cpu1/online # perf record usleep 1 # perf report --header-only | grep 'nrcpus $online\|avail$' # nrcpus online : 4 # nrcpus avail : 4 Acked-by: Namhyung Kim Cc: Adrian Hunter Cc: Borislav Petkov Cc: David Ahern Cc: Frederic Weisbecker Cc: Jiri Olsa Cc: Kan Liang Cc: Stephane Eranian Cc: Wang Nan Fixes: fbe96f29ce4b ("perf tools: Make perf.data more self-descriptive (v8)") Link: http://lkml.kernel.org/r/20150911153323.GP23511@kernel.org Signed-off-by: Arnaldo Carvalho de Melo [lizf: Backported to 3.4: fix it by saving values in an array and then print it in reverse order] Signed-off-by: Zefan Li commit b834fc16378b11a3d753c6299dd2e30a9a52f550 Author: Paul Mackerras Date: Thu Sep 10 14:36:21 2015 +1000 powerpc/MSI: Fix race condition in tearing down MSI interrupts commit e297c939b745e420ef0b9dc989cb87bda617b399 upstream. This fixes a race which can result in the same virtual IRQ number being assigned to two different MSI interrupts. The most visible consequence of that is usually a warning and stack trace from the sysfs code about an attempt to create a duplicate entry in sysfs. The race happens when one CPU (say CPU 0) is disposing of an MSI while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls (for example) pnv_teardown_msi_irqs(), which calls msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its hardware IRQ number) is no longer in use. Then, before CPU 0 gets to calling irq_dispose_mapping() to free up the virtal IRQ number, CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an MSI, and gets the same hardware IRQ number that CPU 0 just freed. CPU 1 then calls irq_create_mapping() to get a virtual IRQ number, which sees that there is currently a mapping for that hardware IRQ number and returns the corresponding virtual IRQ number (which is the same virtual IRQ number that CPU 0 was using). CPU 0 then calls irq_dispose_mapping() and frees that virtual IRQ number. Now, if another CPU comes along and calls irq_create_mapping(), it is likely to get the virtual IRQ number that was just freed, resulting in the same virtual IRQ number apparently being used for two different hardware interrupts. To fix this race, we just move the call to msi_bitmap_free_hwirqs() to after the call to irq_dispose_mapping(). Since virq_to_hw() doesn't work for the virtual IRQ number after irq_dispose_mapping() has been called, we need to call it before irq_dispose_mapping() and remember the result for the msi_bitmap_free_hwirqs() call. The pattern of calling msi_bitmap_free_hwirqs() before irq_dispose_mapping() appears in 5 places under arch/powerpc, and appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") from 2007. Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") Reported-by: Alexey Kardashevskiy Signed-off-by: Paul Mackerras Signed-off-by: Michael Ellerman [bwh: Backported to 3.2: - powernv uses a private functions instead of msi_bitmap_free_hwirqs() - Adjust filename, context] Signed-off-by: Ben Hutchings Signed-off-by: Zefan Li commit 59463bb2d1c003c1b6618c74ad319996f732d82d Author: Ard Biesheuvel Date: Thu Sep 3 13:24:40 2015 +0100 ARM: 8429/1: disable GCC SRA optimization commit a077224fd35b2f7fbc93f14cf67074fc792fbac2 upstream. While working on the 32-bit ARM port of UEFI, I noticed a strange corruption in the kernel log. The following snprintf() statement (in drivers/firmware/efi/efi.c:efi_md_typeattr_format()) snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", was producing the following output in the log: | | | | | |WB|WT|WC|UC] | | | | | |WB|WT|WC|UC] | | | | | |WB|WT|WC|UC] |RUN| | | | |WB|WT|WC|UC]* |RUN| | | | |WB|WT|WC|UC]* | | | | | |WB|WT|WC|UC] |RUN| | | | |WB|WT|WC|UC]* | | | | | |WB|WT|WC|UC] |RUN| | | | | | | |UC] |RUN| | | | | | | |UC] As it turns out, this is caused by incorrect code being emitted for the string() function in lib/vsprintf.c. The following code if (!(spec.flags & LEFT)) { while (len < spec.field_width--) { if (buf < end) *buf = ' '; ++buf; } } for (i = 0; i < len; ++i) { if (buf < end) *buf = *s; ++buf; ++s; } while (len < spec.field_width--) { if (buf < end) *buf = ' '; ++buf; } when called with len == 0, triggers an issue in the GCC SRA optimization pass (Scalar Replacement of Aggregates), which handles promotion of signed struct members incorrectly. This is a known but as yet unresolved issue. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932). In this particular case, it is causing the second while loop to be executed erroneously a single time, causing the additional space characters to be printed. So disable the optimization by passing -fno-ipa-sra. Acked-by: Nicolas Pitre Signed-off-by: Ard Biesheuvel Signed-off-by: Russell King Signed-off-by: Zefan Li commit 3cd0ee55312d9e53ac6966d071c48e218c3a2a53 Author: Christoph Hellwig Date: Wed Sep 9 18:04:18 2015 +0200 scsi_dh: fix randconfig build error commit 294ab783ad98066b87296db1311c7ba2a60206a5 upstream. It looks like the Kconfig check that was meant to fix this (commit fe9233fb6914a0eb20166c967e3020f7f0fba2c9 [SCSI] scsi_dh: fix kconfig related build errors) was actually reversed, but no-one noticed until the new set of patches which separated DM and SCSI_DH). Fixes: fe9233fb6914a0eb20166c967e3020f7f0fba2c9 Signed-off-by: Christoph Hellwig Tested-by: Mike Snitzer Signed-off-by: James Bottomley Signed-off-by: Zefan Li commit 409172802372d568ab1d5460004e109f01abaa39 Author: Hin-Tak Leung Date: Wed Sep 9 15:38:07 2015 -0700 hfs: fix B-tree corruption after insertion at position 0 commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream. Fix B-tree corruption when a new record is inserted at position 0 in the node in hfs_brec_insert(). This is an identical change to the corresponding hfs b-tree code to Sergei Antonov's "hfsplus: fix B-tree corruption after insertion at position 0", to keep similar code paths in the hfs and hfsplus drivers in sync, where appropriate. Signed-off-by: Hin-Tak Leung Cc: Sergei Antonov Cc: Joe Perches Reviewed-by: Vyacheslav Dubeyko Cc: Anton Altaparmakov Cc: Al Viro Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Zefan Li commit 6babbcb0fe14fc13033667281d1bb782f84cbd4d Author: Hin-Tak Leung Date: Wed Sep 9 15:38:04 2015 -0700 hfs,hfsplus: cache pages correctly between bnode_create and bnode_free commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream. Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and hfs_bnode_find() for finding or creating pages corresponding to an inode) are immediately kmap()'ed and used (both read and write) and kunmap()'ed, and should not be page_cache_release()'ed until hfs_bnode_free(). This patch fixes a problem I first saw in July 2012: merely running "du" on a large hfsplus-mounted directory a few times on a reasonably loaded system would get the hfsplus driver all confused and complaining about B-tree inconsistencies, and generates a "BUG: Bad page state". Most recently, I can generate this problem on up-to-date Fedora 22 with shipped kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller mounts) and "du /mnt" simultaneously on two windows, where /mnt is a lightly-used QEMU VM image of the full Mac OS X 10.9: $ df -i / /home /mnt Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/fedora-root 3276800 551665 2725135 17% / /dev/mapper/fedora-home 52879360 716221 52163139 2% /home /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt After applying the patch, I was able to run "du /" (60+ times) and "du /mnt" (150+ times) continuously and simultaneously for 6+ hours. There are many reports of the hfsplus driver getting confused under load and generating "BUG: Bad page state" or other similar issues over the years. [1] The unpatched code [2] has always been wrong since it entered the kernel tree. The only reason why it gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else, most of the time. The current RW driver appears to have followed the design and development of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec 2001) had a B-tree node-centric approach to read_cache_page()/page_cache_release() per bnode_get()/bnode_put(), migrating towards version 0.2 (June 2002) of caching and releasing pages per inode extents. When the current RW code first entered the kernel [2] in 2005, there was an REF_PAGES conditional (and "//" commented out code) to switch between B-node centric paging to inode-centric paging. There was a mistake with the direction of one of the REF_PAGES conditionals in __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were removed, but a page_cache_release() was mistakenly left in (propagating the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out page_cache_release() in bnode_release() (which should be spanned by !REF_PAGES) was never enabled. References: [1]: Michael Fox, Apr 2013 http://www.spinics.net/lists/linux-fsdevel/msg63807.html ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'") Sasha Levin, Feb 2015 http://lkml.org/lkml/2015/2/20/85 ("use after free") https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887 https://bugzilla.kernel.org/show_bug.cgi?id=42342 https://bugzilla.kernel.org/show_bug.cgi?id=63841 https://bugzilla.kernel.org/show_bug.cgi?id=78761 [2]: http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db commit d1081202f1d0ee35ab0beb490da4b65d4bc763db Author: Andrew Morton Date: Wed Feb 25 16:17:36 2004 -0800 [PATCH] HFS rewrite http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\ fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd commit 91556682e0bf004d98a529bf829d339abb98bbbd Author: Andrew Morton Date: Wed Feb 25 16:17:48 2004 -0800 [PATCH] HFS+ support [3]: http://sourceforge.net/projects/linux-hfsplus/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/ http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/ http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\ fs/hfsplus/bnode.c?r1=1.4&r2=1.5 Date: Thu Jun 6 09:45:14 2002 +0000 Use buffer cache instead of page cache in bnode.c. Cache inode extents. [4]: http://git.kernel.org/cgit/linux/kernel/git/\ stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d commit a5e3985fa014029eb6795664c704953720cc7f7d Author: Roman Zippel Date: Tue Sep 6 15:18:47 2005 -0700 [PATCH] hfs: remove debug code Signed-off-by: Hin-Tak Leung Signed-off-by: Sergei Antonov Reviewed-by: Anton Altaparmakov Reported-by: Sasha Levin Cc: Al Viro Cc: Christoph Hellwig Cc: Vyacheslav Dubeyko Cc: Sougata Santra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Zefan Li commit b0cce01be5f58ed399fdfc8e1b0fbcd827a35aef Author: Kees Cook Date: Fri Sep 4 15:44:57 2015 -0700 fs: create and use seq_show_option for escaping commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream. Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook Acked-by: Serge Hallyn Acked-by: Jan Kara Acked-by: Paul Moore Cc: J. R. Okajima Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [lizf: Backported to 3.4: - adjust context - one more place in ceph needs to be changed - drop changes to overlayfs - drop showing vers in cifs] Signed-off-by: Zefan Li commit 7646c507f1ad8bd14a3196f84b0eabb229dafb75 Author: Andrey Ryabinin