commit c52b9710c83d3b8ab63bb217cc7c8b61e13f12cd
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Wed Apr 17 11:15:18 2024 +0200

    Linux 5.15.156
    
    Link: https://lore.kernel.org/r/20240415141942.235939111@linuxfoundation.org
    Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
    Tested-by: Kelsey Steele <kelseysteele@linux.microsoft.com>
    Tested-by: Mark Brown <broonie@kernel.org>
    Tested-by: Ron Economos <re@w6rz.net>
    Tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
    Tested-by: Jon Hunter <jonathanh@nvidia.com>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 88168b947c349169d516b10ee307d3fbcaebaf84
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Tue Apr 2 18:50:03 2024 +0300

    drm/i915/cdclk: Fix CDCLK programming order when pipes are active
    
    commit 7b1f6b5aaec0f849e19c3e99d4eea75876853cdd upstream.
    
    Currently we always reprogram CDCLK from the
    intel_set_cdclk_pre_plane_update() when using squash/crawl.
    The code only works correctly for the cd2x update or full
    modeset cases, and it was simply never updated to deal with
    squash/crawl.
    
    If the CDCLK frequency is increasing we must reprogram it
    before we do anything else that might depend on the new
    higher frequency, and conversely we must not decrease
    the frequency until everything that might still depend
    on the old higher frequency has been dealt with.
    
    Since cdclk_state->pipe is only relevant when doing a cd2x
    update we can't use it to determine the correct sequence
    during squash/crawl. To that end introduce cdclk_state->disable_pipes
    which simply indicates that we must perform the update
    while the pipes are disable (ie. during
    intel_set_cdclk_pre_plane_update()). Otherwise we use the
    same old vs. new CDCLK frequency comparsiong as for cd2x
    updates.
    
    The only remaining problem case is when the voltage_level
    needs to increase due to a DDI port, but the CDCLK frequency
    is decreasing (and not all pipes are being disabled). The
    current approach will not bump the voltage level up until
    after the port has already been enabled, which is too late.
    But we'll take care of that case separately.
    
    v2: Don't break the "must disable pipes case"
    v3: Keep the on stack 'pipe' for future use
    
    Cc: stable@vger.kernel.org
    Fixes: d62686ba3b54 ("drm/i915/adl_p: CDCLK crawl support for ADL")
    Reviewed-by: Uma Shankar <uma.shankar@intel.com>
    Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240402155016.13733-2-ville.syrjala@linux.intel.com
    (cherry picked from commit 3aecee90ac12a351905f12dda7643d5b0676d6ca)
    Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b2bf58581baa048a7a1902273aedc372083f8131
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:51 2024 -0700

    x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
    
    commit 4f511739c54b549061993b53fc0380f48dfca23b upstream.
    
    For consistency with the other CONFIG_MITIGATION_* options, replace the
    CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
    CONFIG_MITIGATION_SPECTRE_BHI option.
    
    [ mingo: Fix ]
    
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Nikolay Borisov <nik.borisov@suse.com>
    Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d315f5eba585183b3b7c452ce95f58da63a5910c
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:50 2024 -0700

    x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
    
    commit 36d4fe147c870f6d3f6602befd7ef44393a1c87a upstream.
    
    Unlike most other mitigations' "auto" options, spectre_bhi=auto only
    mitigates newer systems, which is confusing and not particularly useful.
    
    Remove it.
    
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ebba2270ab744bd9eb8927ff1624f64cb3fe4ca4
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:48 2024 -0700

    x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
    
    commit 5f882f3b0a8bf0788d5a0ee44b1191de5319bb8a upstream.
    
    While syscall hardening helps prevent some BHI attacks, there's still
    other low-hanging fruit remaining.  Don't classify it as a mitigation
    and make it clear that the system may still be vulnerable if it doesn't
    have a HW or SW mitigation enabled.
    
    Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e47d1cbde75936be6538c05737c88292b9d9c31d
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:47 2024 -0700

    x86/bugs: Fix BHI handling of RRSBA
    
    commit 1cea8a280dfd1016148a3820676f2f03e3f5b898 upstream.
    
    The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been
    disabled by the Spectre v2 mitigation (or can otherwise be disabled by
    the BHI mitigation itself if needed).  In that case retpolines are fine.
    
    Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b4f2718f3d9bcb91f00f3194b855e08e85f2d90f
Author: Ingo Molnar <mingo@kernel.org>
Date:   Thu Apr 11 09:25:36 2024 +0200

    x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
    
    commit d0485730d2189ffe5d986d4e9e191f1e4d5ffd24 upstream.
    
    So we are using the 'ia32_cap' value in a number of places,
    which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register.
    
    But there's very little 'IA32' about it - this isn't 32-bit only
    code, nor does it originate from there, it's just a historic
    quirk that many Intel MSR names are prefixed with IA32_.
    
    This is already clear from the helper method around the MSR:
    x86_read_arch_cap_msr(), which doesn't have the IA32 prefix.
    
    So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with
    its role and with the naming of the helper function.
    
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Nikolay Borisov <nik.borisov@suse.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c768db14db8e4139552cd1be244a3a192bb87b9b
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:46 2024 -0700

    x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
    
    commit cb2db5bb04d7f778fbc1a1ea2507aab436f1bff3 upstream.
    
    There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and
    over.  It's even read in the BHI sysfs function which is a big no-no.
    Just read it once and cache it.
    
    Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 145d9930a151e513512919c6db539232a4ce19e5
Author: Josh Poimboeuf <jpoimboe@kernel.org>
Date:   Wed Apr 10 22:40:45 2024 -0700

    x86/bugs: Fix BHI documentation
    
    commit dfe648903f42296866d79f10d03f8c85c9dfba30 upstream.
    
    Fix up some inaccuracies in the BHI documentation.
    
    Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
    Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/r/8c84f7451bfe0dd08543c6082a383f390d4aa7e2.1712813475.git.jpoimboe@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2c761457ef18361cbaceeef28364a2913f2c5dab
Author: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Date:   Tue Apr 9 16:08:05 2024 -0700

    x86/bugs: Fix return type of spectre_bhi_state()
    
    commit 04f4230e2f86a4e961ea5466eda3db8c1762004d upstream.
    
    The definition of spectre_bhi_state() incorrectly returns a const char
    * const. This causes the a compiler warning when building with W=1:
    
     warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
     2812 | static const char * const spectre_bhi_state(void)
    
    Remove the const qualifier from the pointer.
    
    Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob")
    Reported-by: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c6fd0e4f006968b090d93699af1284edcdd95afe
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Mon Apr 8 09:46:01 2024 +0200

    irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
    
    commit c1d11fc2c8320871b40730991071dd0a0b405bc8 upstream.
    
    When building with 'make W=1' but CONFIG_TRACE_IRQFLAGS=n, the
    unused argument to lockdep_hrtimer_exit() causes a warning:
    
    kernel/time/hrtimer.c:1655:14: error: variable 'expires_in_hardirq' set but not used [-Werror=unused-but-set-variable]
    
    This is intentional behavior, so add a cast to void to shut up the warning.
    
    Fixes: 73d20564e0dc ("hrtimer: Don't dereference the hrtimer pointer after the callback")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20240408074609.3170807-1-arnd@kernel.org
    Closes: https://lore.kernel.org/oe-kbuild-all/202311191229.55QXHVc6-lkp@intel.com/
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 69843741d64ff1d783d5708d803f27487b4795e7
Author: Adam Dunlap <acdunlap@google.com>
Date:   Mon Mar 18 16:09:27 2024 -0700

    x86/apic: Force native_apic_mem_read() to use the MOV instruction
    
    commit 5ce344beaca688f4cdea07045e0b8f03dc537e74 upstream.
    
    When done from a virtual machine, instructions that touch APIC memory
    must be emulated. By convention, MMIO accesses are typically performed
    via io.h helpers such as readl() or writeq() to simplify instruction
    emulation/decoding (ex: in KVM hosts and SEV guests) [0].
    
    Currently, native_apic_mem_read() does not follow this convention,
    allowing the compiler to emit instructions other than the MOV
    instruction generated by readl(). In particular, when the kernel is
    compiled with clang and run as a SEV-ES or SEV-SNP guest, the compiler
    would emit a TESTL instruction which is not supported by the SEV-ES
    emulator, causing a boot failure in that environment. It is likely the
    same problem would happen in a TDX guest as that uses the same
    instruction emulator as SEV-ES.
    
    To make sure all emulators can emulate APIC memory reads via MOV, use
    the readl() function in native_apic_mem_read(). It is expected that any
    emulator would support MOV in any addressing mode as it is the most
    generic and is what is usually emitted currently.
    
    The TESTL instruction is emitted when native_apic_mem_read() is inlined
    into apic_mem_wait_icr_idle(). The emulator comes from
    insn_decode_mmio() in arch/x86/lib/insn-eval.c. It's not worth it to
    extend insn_decode_mmio() to support more instructions since, in theory,
    the compiler could choose to output nearly any instruction for such
    reads which would bloat the emulator beyond reason.
    
      [0] https://lore.kernel.org/all/20220405232939.73860-12-kirill.shutemov@linux.intel.com/
    
      [ bp: Massage commit message, fix typos. ]
    
    Signed-off-by: Adam Dunlap <acdunlap@google.com>
    Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
    Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
    Tested-by: Kevin Loughlin <kevinloughlin@google.com>
    Cc: <stable@vger.kernel.org>
    Link: https://lore.kernel.org/r/20240318230927.2191933-1-acdunlap@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c2981e32cf4674d0e9ecc76e7ff595c9d6188ade
Author: John Stultz <jstultz@google.com>
Date:   Wed Apr 10 16:26:30 2024 -0700

    selftests: timers: Fix abs() warning in posix_timers test
    
    commit ed366de8ec89d4f960d66c85fc37d9de22f7bf6d upstream.
    
    Building with clang results in the following warning:
    
      posix_timers.c:69:6: warning: absolute value function 'abs' given an
          argument of type 'long long' but has parameter of type 'int' which may
          cause truncation of value [-Wabsolute-value]
            if (abs(diff - DELAY * USECS_PER_SEC) > USECS_PER_SEC / 2) {
                ^
    So switch to using llabs() instead.
    
    Fixes: 0bc4b0cf1570 ("selftests: add basic posix timers selftests")
    Signed-off-by: John Stultz <jstultz@google.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20240410232637.4135564-3-jstultz@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 70688450dddaf91e12fd4fc625da3297025932c9
Author: Sean Christopherson <seanjc@google.com>
Date:   Tue Apr 9 10:51:05 2024 -0700

    x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
    
    commit f337a6a21e2fd67eadea471e93d05dd37baaa9be upstream.
    
    Initialize cpu_mitigations to CPU_MITIGATIONS_OFF if the kernel is built
    with CONFIG_SPECULATION_MITIGATIONS=n, as the help text quite clearly
    states that disabling SPECULATION_MITIGATIONS is supposed to turn off all
    mitigations by default.
    
      │ If you say N, all mitigations will be disabled. You really
      │ should know what you are doing to say so.
    
    As is, the kernel still defaults to CPU_MITIGATIONS_AUTO, which results in
    some mitigations being enabled in spite of SPECULATION_MITIGATIONS=n.
    
    Fixes: f43b9876e857 ("x86/retbleed: Add fine grained Kconfig knobs")
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
    Cc: stable@vger.kernel.org
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: https://lore.kernel.org/r/20240409175108.1512861-2-seanjc@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e8f4a290abe9c83b280379367101370cae3014f8
Author: Namhyung Kim <namhyung@kernel.org>
Date:   Tue Mar 5 22:10:03 2024 -0800

    perf/x86: Fix out of range data
    
    commit dec8ced871e17eea46f097542dd074d022be4bd1 upstream.
    
    On x86 each struct cpu_hw_events maintains a table for counter assignment but
    it missed to update one for the deleted event in x86_pmu_del().  This
    can make perf_clear_dirty_counters() reset used counter if it's called
    before event scheduling or enabling.  Then it would return out of range
    data which doesn't make sense.
    
    The following code can reproduce the problem.
    
      $ cat repro.c
      #include <pthread.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <linux/perf_event.h>
      #include <sys/ioctl.h>
      #include <sys/mman.h>
      #include <sys/syscall.h>
    
      struct perf_event_attr attr = {
            .type = PERF_TYPE_HARDWARE,
            .config = PERF_COUNT_HW_CPU_CYCLES,
            .disabled = 1,
      };
    
      void *worker(void *arg)
      {
            int cpu = (long)arg;
            int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
            int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
            void *p;
    
            do {
                    ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
                    p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0);
                    ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
    
                    ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0);
                    munmap(p, 4096);
                    ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0);
            } while (1);
    
            return NULL;
      }
    
      int main(void)
      {
            int i;
            int n = sysconf(_SC_NPROCESSORS_ONLN);
            pthread_t *th = calloc(n, sizeof(*th));
    
            for (i = 0; i < n; i++)
                    pthread_create(&th[i], NULL, worker, (void *)(long)i);
            for (i = 0; i < n; i++)
                    pthread_join(th[i], NULL);
    
            free(th);
            return 0;
      }
    
    And you can see the out of range data using perf stat like this.
    Probably it'd be easier to see on a large machine.
    
      $ gcc -o repro repro.c -pthread
      $ ./repro &
      $ sudo perf stat -A -I 1000 2>&1 | awk '{ if (length($3) > 15) print }'
           1.001028462 CPU6   196,719,295,683,763      cycles                           # 194290.996 GHz                       (71.54%)
           1.001028462 CPU3   396,077,485,787,730      branch-misses                    # 15804359784.80% of all branches      (71.07%)
           1.001028462 CPU17  197,608,350,727,877      branch-misses                    # 14594186554.56% of all branches      (71.22%)
           2.020064073 CPU4   198,372,472,612,140      cycles                           # 194681.113 GHz                       (70.95%)
           2.020064073 CPU6   199,419,277,896,696      cycles                           # 195720.007 GHz                       (70.57%)
           2.020064073 CPU20  198,147,174,025,639      cycles                           # 194474.654 GHz                       (71.03%)
           2.020064073 CPU20  198,421,240,580,145      stalled-cycles-frontend          #  100.14% frontend cycles idle        (70.93%)
           3.037443155 CPU4   197,382,689,923,416      cycles                           # 194043.065 GHz                       (71.30%)
           3.037443155 CPU20  196,324,797,879,414      cycles                           # 193003.773 GHz                       (71.69%)
           3.037443155 CPU5   197,679,956,608,205      stalled-cycles-backend           # 1315606428.66% backend cycles idle   (71.19%)
           3.037443155 CPU5   198,571,860,474,851      instructions                     # 13215422.58  insn per cycle
    
    It should move the contents in the cpuc->assign as well.
    
    Fixes: 5471eea5d3bf ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
    Signed-off-by: Namhyung Kim <namhyung@kernel.org>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit acf9b01d344f2114f20f020379c69a48ebde5b29
Author: Gavin Shan <gshan@redhat.com>
Date:   Thu Mar 28 10:21:47 2024 +1000

    vhost: Add smp_rmb() in vhost_vq_avail_empty()
    
    commit 22e1992cf7b034db5325660e98c41ca5afa5f519 upstream.
    
    A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
    Will. Otherwise, it's not ensured the available ring entries pushed
    by guest can be observed by vhost in time, leading to stale available
    ring entries fetched by vhost in vhost_get_vq_desc(), as reported by
    Yihuang Yu on NVidia's grace-hopper (ARM64) platform.
    
      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64      \
      -accel kvm -machine virt,gic-version=host -cpu host          \
      -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
      -m 4096M,slots=16,maxmem=64G                                 \
      -object memory-backend-ram,id=mem0,size=4096M                \
       :                                                           \
      -netdev tap,id=vnet0,vhost=true                              \
      -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
       :
      guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
      virtio_net virtio0: output.0:id 100 is not a head!
    
    Add the missed smp_rmb() in vhost_vq_avail_empty(). When tx_can_batch()
    returns true, it means there's still pending tx buffers. Since it might
    read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc().
    Note that it should be safe until vq->avail_idx is changed by commit
    275bf960ac697 ("vhost: better detection of available buffers").
    
    Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
    Cc: <stable@kernel.org> # v4.11+
    Reported-by: Yihuang Yu <yihyu@redhat.com>
    Suggested-by: Will Deacon <will@kernel.org>
    Signed-off-by: Gavin Shan <gshan@redhat.com>
    Acked-by: Jason Wang <jasowang@redhat.com>
    Message-Id: <20240328002149.1141302-2-gshan@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
    Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d2dc6600d4e3e1453e3b1fb233e9f97e2a1ae949
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Apr 4 23:33:25 2024 +0300

    drm/client: Fully protect modes[] with dev->mode_config.mutex
    
    commit 3eadd887dbac1df8f25f701e5d404d1b90fd0fea upstream.
    
    The modes[] array contains pointers to modes on the connectors'
    mode lists, which are protected by dev->mode_config.mutex.
    Thus we need to extend modes[] the same protection or by the
    time we use it the elements may already be pointing to
    freed/reused memory.
    
    Cc: stable@vger.kernel.org
    Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10583
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240404203336.10454-2-ville.syrjala@linux.intel.com
    Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
    Reviewed-by: Jani Nikula <jani.nikula@intel.com>
    Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 773d38f42bbe5b359620fe6ef8bc7a57e84f96a3
Author: Boris Burkov <boris@bur.io>
Date:   Tue Mar 19 10:54:22 2024 -0700

    btrfs: qgroup: correctly model root qgroup rsv in convert
    
    commit 141fb8cd206ace23c02cd2791c6da52c1d77d42a upstream.
    
    We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
    pertrans reservations for subvolumes when quotas are enabled. The
    convert function does not properly increment pertrans after decrementing
    prealloc, so the count is not accurate.
    
    Note: we check that the fs is not read-only to mirror the logic in
    qgroup_convert_meta, which checks that before adding to the pertrans rsv.
    
    Fixes: 8287475a2055 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
    CC: stable@vger.kernel.org # 6.1+
    Reviewed-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: Boris Burkov <boris@bur.io>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 23b57c5566097a5a973e0ed9948b51bda5ca9c63
Author: Jacob Pan <jacob.jun.pan@linux.intel.com>
Date:   Thu Apr 11 11:07:43 2024 +0800

    iommu/vt-d: Allocate local memory for page request queue
    
    [ Upstream commit a34f3e20ddff02c4f12df2c0635367394e64c63d ]
    
    The page request queue is per IOMMU, its allocation should be made
    NUMA-aware for performance reasons.
    
    Fixes: a222a7f0bb6c ("iommu/vt-d: Implement page request handling")
    Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
    Reviewed-by: Kevin Tian <kevin.tian@intel.com>
    Link: https://lore.kernel.org/r/20240403214007.985600-1-jacob.jun.pan@linux.intel.com
    Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
    Signed-off-by: Joerg Roedel <jroedel@suse.de>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 81f3ad644fbfaf1f0a2bd40885d38d2527886cfb
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed Apr 3 10:06:24 2024 +0200

    tracing: hide unused ftrace_event_id_fops
    
    [ Upstream commit 5281ec83454d70d98b71f1836fb16512566c01cd ]
    
    When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
    unused ftrace_event_id_fops variable:
    
    kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
     2155 | static const struct file_operations ftrace_event_id_fops = {
    
    Hide this in the same #ifdef as the reference to it.
    
    Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org
    
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Zheng Yejian <zhengyejian1@huawei.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Ajay Kaher <akaher@vmware.com>
    Cc: Jinjie Ruan <ruanjinjie@huawei.com>
    Cc: Clément Léger <cleger@rivosinc.com>
    Cc: Dan Carpenter <dan.carpenter@linaro.org>
    Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
    Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to event_create_dir()")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit fdfbf54d128ab6ab255db138488f9650485795a2
Author: David Arinzon <darinzon@amazon.com>
Date:   Wed Apr 10 09:13:57 2024 +0000

    net: ena: Fix incorrect descriptor free behavior
    
    [ Upstream commit bf02d9fe00632d22fa91d34749c7aacf397b6cde ]
    
    ENA has two types of TX queues:
    - queues which only process TX packets arriving from the network stack
    - queues which only process TX packets forwarded to it by XDP_REDIRECT
      or XDP_TX instructions
    
    The ena_free_tx_bufs() cycles through all descriptors in a TX queue
    and unmaps + frees every descriptor that hasn't been acknowledged yet
    by the device (uncompleted TX transactions).
    The function assumes that the processed TX queue is necessarily from
    the first category listed above and ends up using napi_consume_skb()
    for descriptors belonging to an XDP specific queue.
    
    This patch solves a bug in which, in case of a VF reset, the
    descriptors aren't freed correctly, leading to crashes.
    
    Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action")
    Signed-off-by: Shay Agroskin <shayagr@amazon.com>
    Signed-off-by: David Arinzon <darinzon@amazon.com>
    Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit ec25a9ce095a8b660a325693c7d1c3b1bfb4df8f
Author: David Arinzon <darinzon@amazon.com>
Date:   Wed Apr 10 09:13:56 2024 +0000

    net: ena: Wrong missing IO completions check order
    
    [ Upstream commit f7e417180665234fdb7af2ebe33d89aaa434d16f ]
    
    Missing IO completions check is called every second (HZ jiffies).
    This commit fixes several issues with this check:
    
    1. Duplicate queues check:
       Max of 4 queues are scanned on each check due to monitor budget.
       Once reaching the budget, this check exits under the assumption that
       the next check will continue to scan the remainder of the queues,
       but in practice, next check will first scan the last already scanned
       queue which is not necessary and may cause the full queue scan to
       last a couple of seconds longer.
       The fix is to start every check with the next queue to scan.
       For example, on 8 IO queues:
       Bug: [0,1,2,3], [3,4,5,6], [6,7]
       Fix: [0,1,2,3], [4,5,6,7]
    
    2. Unbalanced queues check:
       In case the number of active IO queues is not a multiple of budget,
       there will be checks which don't utilize the full budget
       because the full scan exits when reaching the last queue id.
       The fix is to run every TX completion check with exact queue budget
       regardless of the queue id.
       For example, on 7 IO queues:
       Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
       Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
       The budget may be lowered in case the number of IO queues is less
       than the budget (4) to make sure there are no duplicate queues on
       the same check.
       For example, on 3 IO queues:
       Bug: [0,1,2,0], [1,2,0,1]
       Fix: [0,1,2], [0,1,2]
    
    Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
    Signed-off-by: Amit Bernstein <amitbern@amazon.com>
    Signed-off-by: David Arinzon <darinzon@amazon.com>
    Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit e667a05cbb391506de2f525e116b2b5e2710e485
Author: David Arinzon <darinzon@amazon.com>
Date:   Wed Apr 10 09:13:55 2024 +0000

    net: ena: Fix potential sign extension issue
    
    [ Upstream commit 713a85195aad25d8a26786a37b674e3e5ec09e3c ]
    
    Small unsigned types are promoted to larger signed types in
    the case of multiplication, the result of which may overflow.
    In case the result of such a multiplication has its MSB
    turned on, it will be sign extended with '1's.
    This changes the multiplication result.
    
    Code example of the phenomenon:
    -------------------------------
    u16 x, y;
    size_t z1, z2;
    
    x = y = 0xffff;
    printk("x=%x y=%x\n",x,y);
    
    z1 = x*y;
    z2 = (size_t)x*y;
    
    printk("z1=%lx z2=%lx\n", z1, z2);
    
    Output:
    -------
    x=ffff y=ffff
    z1=fffffffffffe0001 z2=fffe0001
    
    The expected result of ffff*ffff is fffe0001, and without the
    explicit casting to avoid the unwanted sign extension we got
    fffffffffffe0001.
    
    This commit adds an explicit casting to avoid the sign extension
    issue.
    
    Fixes: 689b2bdaaa14 ("net: ena: add functions for handling Low Latency Queues in ena_com")
    Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
    Signed-off-by: David Arinzon <darinzon@amazon.com>
    Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit e76c2678228f6aec74b305ae30c9374cc2f28a51
Author: Michal Luczaj <mhal@rbox.co>
Date:   Tue Apr 9 22:09:39 2024 +0200

    af_unix: Fix garbage collector racing against connect()
    
    [ Upstream commit 47d8ac011fe1c9251070e1bd64cb10b48193ec51 ]
    
    Garbage collector does not take into account the risk of embryo getting
    enqueued during the garbage collection. If such embryo has a peer that
    carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
    different set of children. Leading to an incorrectly elevated inflight
    count, and then a dangling pointer within the gc_inflight_list.
    
    sockets are AF_UNIX/SOCK_STREAM
    S is an unconnected socket
    L is a listening in-flight socket bound to addr, not in fdtable
    V's fd will be passed via sendmsg(), gets inflight count bumped
    
    connect(S, addr)        sendmsg(S, [V]); close(V)       __unix_gc()
    ----------------        -------------------------       -----------
    
    NS = unix_create1()
    skb1 = sock_wmalloc(NS)
    L = unix_find_other(addr)
    unix_state_lock(L)
    unix_peer(S) = NS
                            // V count=1 inflight=0
    
                            NS = unix_peer(S)
                            skb2 = sock_alloc()
                            skb_queue_tail(NS, skb2[V])
    
                            // V became in-flight
                            // V count=2 inflight=1
    
                            close(V)
    
                            // V count=1 inflight=1
                            // GC candidate condition met
    
                                                    for u in gc_inflight_list:
                                                      if (total_refs == inflight_refs)
                                                        add u to gc_candidates
    
                                                    // gc_candidates={L, V}
    
                                                    for u in gc_candidates:
                                                      scan_children(u, dec_inflight)
    
                                                    // embryo (skb1) was not
                                                    // reachable from L yet, so V's
                                                    // inflight remains unchanged
    __skb_queue_tail(L, skb1)
    unix_state_unlock(L)
                                                    for u in gc_candidates:
                                                      if (u.inflight)
                                                        scan_children(u, inc_inflight_move_tail)
    
                                                    // V count=1 inflight=2 (!)
    
    If there is a GC-candidate listening socket, lock/unlock its state. This
    makes GC wait until the end of any ongoing connect() to that socket. After
    flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
    there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
    this point, unix_inflight() can not happen because unix_gc_lock is already
    taken. Inflight graph remains unaffected.
    
    Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
    Signed-off-by: Michal Luczaj <mhal@rbox.co>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.co
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 37120fa8d92a43f538ff6b7646948a55cdbad266
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Jan 23 09:08:53 2024 -0800

    af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
    
    [ Upstream commit 97af84a6bba2ab2b9c704c08e67de3b5ea551bb2 ]
    
    When touching unix_sk(sk)->inflight, we are always under
    spin_lock(&unix_gc_lock).
    
    Let's convert unix_sk(sk)->inflight to the normal unsigned long.
    
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Link: https://lore.kernel.org/r/20240123170856.41348-3-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Stable-dep-of: 47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 22641478d80f169aef136e60cef4a824a0dafe1a
Author: Arınç ÜNAL <arinc.unal@arinc9.com>
Date:   Tue Apr 9 18:01:14 2024 +0300

    net: dsa: mt7530: trap link-local frames regardless of ST Port State
    
    [ Upstream commit 17c560113231ddc20088553c7b499b289b664311 ]
    
    In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
    (DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
    are described; the medium access control (MAC) and logical link control
    (LLC) sublayers. The MAC sublayer is the one facing the physical layer.
    
    In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
    Bridge component comprises a MAC Relay Entity for interconnecting the Ports
    of the Bridge, at least two Ports, and higher layer entities with at least
    a Spanning Tree Protocol Entity included.
    
    Each Bridge Port also functions as an end station and shall provide the MAC
    Service to an LLC Entity. Each instance of the MAC Service is provided to a
    distinct LLC Entity that supports protocol identification, multiplexing,
    and demultiplexing, for protocol data unit (PDU) transmission and reception
    by one or more higher layer entities.
    
    It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
    Entity associated with each Bridge Port is modeled as being directly
    connected to the attached Local Area Network (LAN).
    
    On the switch with CPU port architecture, CPU port functions as Management
    Port, and the Management Port functionality is provided by software which
    functions as an end station. Software is connected to an IEEE 802 LAN that
    is wholly contained within the system that incorporates the Bridge.
    Software provides access to the LLC Entity associated with each Bridge Port
    by the value of the source port field on the special tag on the frame
    received by software.
    
    We call frames that carry control information to determine the active
    topology and current extent of each Virtual Local Area Network (VLAN),
    i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
    Registration Protocol Data Units (MVRPDUs), and frames from other link
    constrained protocols, such as Extensible Authentication Protocol over LAN
    (EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
    are not forwarded by a Bridge. Permanently configured entries in the
    filtering database (FDB) ensure that such frames are discarded by the
    Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
    detail:
    
    Each of the reserved MAC addresses specified in Table 8-1
    (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
    permanently configured in the FDB in C-VLAN components and ERs.
    
    Each of the reserved MAC addresses specified in Table 8-2
    (01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
    configured in the FDB in S-VLAN components.
    
    Each of the reserved MAC addresses specified in Table 8-3
    (01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
    in TPMR components.
    
    The FDB entries for reserved MAC addresses shall specify filtering for all
    Bridge Ports and all VIDs. Management shall not provide the capability to
    modify or remove entries for reserved MAC addresses.
    
    The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
    propagation of PDUs within a Bridged Network, as follows:
    
      The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
      no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
      component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
      PDUs transmitted using this destination address, or any other addresses
      that appear in Table 8-1, Table 8-2, and Table 8-3
      (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
      therefore travel no further than those stations that can be reached via a
      single individual LAN from the originating station.
    
      The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
      address that no conformant S-VLAN component, C-VLAN component, or MAC
      Bridge can forward; however, this address is relayed by a TPMR component.
      PDUs using this destination address, or any of the other addresses that
      appear in both Table 8-1 and Table 8-2 but not in Table 8-3
      (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
      by any TPMRs but will propagate no further than the nearest S-VLAN
      component, C-VLAN component, or MAC Bridge.
    
      The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
      address that no conformant C-VLAN component, MAC Bridge can forward;
      however, it is relayed by TPMR components and S-VLAN components. PDUs
      using this destination address, or any of the other addresses that appear
      in Table 8-1 but not in either Table 8-2 or Table 8-3
      (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
      S-VLAN components but will propagate no further than the nearest C-VLAN
      component or MAC Bridge.
    
    Because the LLC Entity associated with each Bridge Port is provided via CPU
    port, we must not filter these frames but forward them to CPU port.
    
    In a Bridge, the transmission Port is majorly decided by ingress and egress
    rules, FDB, and spanning tree Port State functions of the Forwarding
    Process. For link-local frames, only CPU port should be designated as
    destination port in the FDB, and the other functions of the Forwarding
    Process must not interfere with the decision of the transmission Port. We
    call this process trapping frames to CPU port.
    
    Therefore, on the switch with CPU port architecture, link-local frames must
    be trapped to CPU port, and certain link-local frames received by a Port of
    a Bridge comprising a TPMR component or an S-VLAN component must be
    excluded from it.
    
    A Bridge of the switch with CPU port architecture cannot comprise a
    Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
    subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
    (Management Port doesn't count) of this architecture will either function
    as a standard MAC Bridge or a standard VLAN Bridge.
    
    Therefore, a Bridge of this architecture can only comprise S-VLAN
    components, C-VLAN components, or MAC Bridge components. Since there's no
    TPMR component, we don't need to relay PDUs using the destination addresses
    specified on the Nearest non-TPMR section, and the proportion of the
    Nearest Customer Bridge section where they must be relayed by TPMR
    components.
    
    One option to trap link-local frames to CPU port is to add static FDB
    entries with CPU port designated as destination port. However, because that
    Independent VLAN Learning (IVL) is being used on every VID, each entry only
    applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
    Bridge component or a C-VLAN component, there would have to be 16 times
    4096 entries. This switch intellectual property can only hold a maximum of
    2048 entries. Using this option, there also isn't a mechanism to prevent
    link-local frames from being discarded when the spanning tree Port State of
    the reception Port is discarding.
    
    The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
    registers. Whilst this applies to every VID, it doesn't contain all of the
    reserved MAC addresses without affecting the remaining Standard Group MAC
    Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
    the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
    addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
    destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
    The latter option provides better but not complete conformance.
    
    This switch intellectual property also does not provide a mechanism to trap
    link-local frames with specific destination addresses to CPU port by
    Bridge, to conform to the filtering rules for the distinct Bridge
    components.
    
    Therefore, regardless of the type of the Bridge component, link-local
    frames with these destination addresses will be trapped to CPU port:
    
    01-80-C2-00-00-[00,01,02,03,0E]
    
    In a Bridge comprising a MAC Bridge component or a C-VLAN component:
    
      Link-local frames with these destination addresses won't be trapped to
      CPU port which won't conform to IEEE Std 802.1Q-2022:
    
      01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
    
    In a Bridge comprising an S-VLAN component:
    
      Link-local frames with these destination addresses will be trapped to CPU
      port which won't conform to IEEE Std 802.1Q-2022:
    
      01-80-C2-00-00-00
    
      Link-local frames with these destination addresses won't be trapped to
      CPU port which won't conform to IEEE Std 802.1Q-2022:
    
      01-80-C2-00-00-[04,05,06,07,08,09,0A]
    
    Currently on this switch intellectual property, if the spanning tree Port
    State of the reception Port is discarding, link-local frames will be
    discarded.
    
    To trap link-local frames regardless of the spanning tree Port State, make
    the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
    intellectual property only lets the frames regarded as BPDUs bypass the
    spanning tree Port State function of the Forwarding Process.
    
    With this change, the only remaining interference is the ingress rules.
    When the reception Port has no PVID assigned on software, VLAN-untagged
    frames won't be allowed in. There doesn't seem to be a mechanism on the
    switch intellectual property to have link-local frames bypass this function
    of the Forwarding Process.
    
    Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
    Reviewed-by: Daniel Golle <daniel@makrotopia.org>
    Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
    Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 26515606ecb5be4fe439278f8687c18357929e3d
Author: Daniel Machon <daniel.machon@microchip.com>
Date:   Tue Apr 9 12:41:59 2024 +0200

    net: sparx5: fix wrong config being used when reconfiguring PCS
    
    [ Upstream commit 33623113a48ea906f1955cbf71094f6aa4462e8f ]
    
    The wrong port config is being used if the PCS is reconfigured. Fix this
    by correctly using the new config instead of the old one.
    
    Fixes: 946e7fd5053a ("net: sparx5: add port module support")
    Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
    Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
    Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 7aaee12b804c5e0374e7b132b6ec2158ff33dd64
Author: Cosmin Ratiu <cratiu@nvidia.com>
Date:   Tue Apr 9 22:08:12 2024 +0300

    net/mlx5: Properly link new fs rules into the tree
    
    [ Upstream commit 7c6782ad4911cbee874e85630226ed389ff2e453 ]
    
    Previously, add_rule_fg would only add newly created rules from the
    handle into the tree when they had a refcount of 1. On the other hand,
    create_flow_handle tries hard to find and reference already existing
    identical rules instead of creating new ones.
    
    These two behaviors can result in a situation where create_flow_handle
    1) creates a new rule and references it, then
    2) in a subsequent step during the same handle creation references it
       again,
    resulting in a rule with a refcount of 2 that is not linked into the
    tree, will have a NULL parent and root and will result in a crash when
    the flow group is deleted because del_sw_hw_rule, invoked on rule
    deletion, assumes node->parent is != NULL.
    
    This happened in the wild, due to another bug related to incorrect
    handling of duplicate pkt_reformat ids, which lead to the code in
    create_flow_handle incorrectly referencing a just-added rule in the same
    flow handle, resulting in the problem described above. Full details are
    at [1].
    
    This patch changes add_rule_fg to add new rules without parents into
    the tree, properly initializing them and avoiding the crash. This makes
    it more consistent with how rules are added to an FTE in
    create_flow_handle.
    
    Fixes: 74491de93712 ("net/mlx5: Add multi dest support")
    Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u [1]
    Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
    Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
    Reviewed-by: Mark Bloch <mbloch@nvidia.com>
    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
    Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
    Link: https://lore.kernel.org/r/20240409190820.227554-5-tariqt@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 97dab36e57c64106e1c8ebd66cbf0d2d1e52d6b7
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Apr 9 12:07:41 2024 +0000

    netfilter: complete validation of user input
    
    [ Upstream commit 65acf6e0501ac8880a4f73980d01b5d27648b956 ]
    
    In my recent commit, I missed that do_replace() handlers
    use copy_from_sockptr() (which I fixed), followed
    by unsafe copy_from_sockptr_offset() calls.
    
    In all functions, we can perform the @optlen validation
    before even calling xt_alloc_table_info() with the following
    check:
    
    if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
            return -EINVAL;
    
    Fixes: 0c83842df40f ("netfilter: validate user input for expected length")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 4b19e9507c275de0cfe61c24db69179dc52cf9fb
Author: Jiri Benc <jbenc@redhat.com>
Date:   Mon Apr 8 16:18:21 2024 +0200

    ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
    
    [ Upstream commit 7633c4da919ad51164acbf1aa322cc1a3ead6129 ]
    
    Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
    still means hlist_for_each_entry_rcu can return an item that got removed
    from the list. The memory itself of such item is not freed thanks to RCU
    but nothing guarantees the actual content of the memory is sane.
    
    In particular, the reference count can be zero. This can happen if
    ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
    from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
    references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
    timing, this can happen:
    
    1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.
    
    2. Then, the whole ipv6_del_addr is executed for the given entry. The
       reference count drops to zero and kfree_rcu is scheduled.
    
    3. ipv6_get_ifaddr continues and tries to increments the reference count
       (in6_ifa_hold).
    
    4. The rcu is unlocked and the entry is freed.
    
    5. The freed entry is returned.
    
    Prevent increasing of the reference count in such case. The name
    in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.
    
    [   41.506330] refcount_t: addition on 0; use-after-free.
    [   41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
    [   41.507413] Modules linked in: veth bridge stp llc
    [   41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
    [   41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    [   41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
    [   41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
    [   41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
    [   41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
    [   41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
    [   41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
    [   41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
    [   41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
    [   41.514086] FS:  00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
    [   41.514726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
    [   41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [   41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [   41.516799] Call Trace:
    [   41.517037]  <TASK>
    [   41.517249]  ? __warn+0x7b/0x120
    [   41.517535]  ? refcount_warn_saturate+0xa5/0x130
    [   41.517923]  ? report_bug+0x164/0x190
    [   41.518240]  ? handle_bug+0x3d/0x70
    [   41.518541]  ? exc_invalid_op+0x17/0x70
    [   41.520972]  ? asm_exc_invalid_op+0x1a/0x20
    [   41.521325]  ? refcount_warn_saturate+0xa5/0x130
    [   41.521708]  ipv6_get_ifaddr+0xda/0xe0
    [   41.522035]  inet6_rtm_getaddr+0x342/0x3f0
    [   41.522376]  ? __pfx_inet6_rtm_getaddr+0x10/0x10
    [   41.522758]  rtnetlink_rcv_msg+0x334/0x3d0
    [   41.523102]  ? netlink_unicast+0x30f/0x390
    [   41.523445]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
    [   41.523832]  netlink_rcv_skb+0x53/0x100
    [   41.524157]  netlink_unicast+0x23b/0x390
    [   41.524484]  netlink_sendmsg+0x1f2/0x440
    [   41.524826]  __sys_sendto+0x1d8/0x1f0
    [   41.525145]  __x64_sys_sendto+0x1f/0x30
    [   41.525467]  do_syscall_64+0xa5/0x1b0
    [   41.525794]  entry_SYSCALL_64_after_hwframe+0x72/0x7a
    [   41.526213] RIP: 0033:0x7fbc4cfcea9a
    [   41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
    [   41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
    [   41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
    [   41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
    [   41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
    [   41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [   41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
    [   41.531573]  </TASK>
    
    Fixes: 5c578aedcb21d ("IPv6: convert addrconf hash list to RCU")
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Jiri Benc <jbenc@redhat.com>
    Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 6179cdbfe05df6ef66bb0681883f6bc04cab515e
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Mon Apr 8 09:42:03 2024 +0200

    ipv4/route: avoid unused-but-set-variable warning
    
    [ Upstream commit cf1b7201df59fb936f40f4a807433fe3f2ce310a ]
    
    The log_martians variable is only used in an #ifdef, causing a 'make W=1'
    warning with gcc:
    
    net/ipv4/route.c: In function 'ip_rt_send_redirect':
    net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable]
    
    Change the #ifdef to an equivalent IS_ENABLED() to let the compiler
    see where the variable is used.
    
    Fixes: 30038fc61adf ("net: ip_rt_send_redirect() optimization")
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit ed94af8d07d55caf9202e33c11423f5881727f72
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Mon Apr 8 09:42:02 2024 +0200

    ipv6: fib: hide unused 'pn' variable
    
    [ Upstream commit 74043489fcb5e5ca4074133582b5b8011b67f9e7 ]
    
    When CONFIG_IPV6_SUBTREES is disabled, the only user is hidden, causing
    a 'make W=1' warning:
    
    net/ipv6/ip6_fib.c: In function 'fib6_add':
    net/ipv6/ip6_fib.c:1388:32: error: variable 'pn' set but not used [-Werror=unused-but-set-variable]
    
    Add another #ifdef around the variable declaration, matching the other
    uses in this file.
    
    Fixes: 66729e18df08 ("[IPV6] ROUTE: Make sure we have fn->leaf when adding a node on subtree.")
    Link: https://lore.kernel.org/netdev/20240322131746.904943-1-arnd@kernel.org/
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240408074219.3030256-1-arnd@kernel.org
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 98b3e282623f991afeaa1d3719a6202144dbe300
Author: Geetha sowjanya <gakula@marvell.com>
Date:   Mon Apr 8 12:06:43 2024 +0530

    octeontx2-af: Fix NIX SQ mode and BP config
    
    [ Upstream commit faf23006185e777db18912685922c5ddb2df383f ]
    
    NIX SQ mode and link backpressure configuration is required for
    all platforms. But in current driver this code is wrongly placed
    under specific platform check. This patch fixes the issue by
    moving the code out of platform check.
    
    Fixes: 5d9b976d4480 ("octeontx2-af: Support fixed transmit scheduler topology")
    Signed-off-by: Geetha sowjanya <gakula@marvell.com>
    Link: https://lore.kernel.org/r/20240408063643.26288-1-gakula@marvell.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit b4bc99d04c689b5652665394ae8d3e02fb754153
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri Apr 5 15:10:57 2024 -0700

    af_unix: Clear stale u->oob_skb.
    
    [ Upstream commit b46f4eaa4f0ec38909fb0072eea3aeddb32f954e ]
    
    syzkaller started to report deadlock of unix_gc_lock after commit
    4090fa373f0e ("af_unix: Replace garbage collection algorithm."), but
    it just uncovers the bug that has been there since commit 314001f0bf92
    ("af_unix: Add OOB support").
    
    The repro basically does the following.
    
      from socket import *
      from array import array
    
      c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
      c1.sendmsg([b'a'], [(SOL_SOCKET, SCM_RIGHTS, array("i", [c2.fileno()]))], MSG_OOB)
      c2.recv(1)  # blocked as no normal data in recv queue
    
      c2.close()  # done async and unblock recv()
      c1.close()  # done async and trigger GC
    
    A socket sends its file descriptor to itself as OOB data and tries to
    receive normal data, but finally recv() fails due to async close().
    
    The problem here is wrong handling of OOB skb in manage_oob().  When
    recvmsg() is called without MSG_OOB, manage_oob() is called to check
    if the peeked skb is OOB skb.  In such a case, manage_oob() pops it
    out of the receive queue but does not clear unix_sock(sk)->oob_skb.
    This is wrong in terms of uAPI.
    
    Let's say we send "hello" with MSG_OOB, and "world" without MSG_OOB.
    The 'o' is handled as OOB data.  When recv() is called twice without
    MSG_OOB, the OOB data should be lost.
    
      >>> from socket import *
      >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM, 0)
      >>> c1.send(b'hello', MSG_OOB)  # 'o' is OOB data
      5
      >>> c1.send(b'world')
      5
      >>> c2.recv(5)  # OOB data is not received
      b'hell'
      >>> c2.recv(5)  # OOB date is skipped
      b'world'
      >>> c2.recv(5, MSG_OOB)  # This should return an error
      b'o'
    
    In the same situation, TCP actually returns -EINVAL for the last
    recv().
    
    Also, if we do not clear unix_sk(sk)->oob_skb, unix_poll() always set
    EPOLLPRI even though the data has passed through by previous recv().
    
    To avoid these issues, we must clear unix_sk(sk)->oob_skb when dequeuing
    it from recv queue.
    
    The reason why the old GC did not trigger the deadlock is because the
    old GC relied on the receive queue to detect the loop.
    
    When it is triggered, the socket with OOB data is marked as GC candidate
    because file refcount == inflight count (1).  However, after traversing
    all inflight sockets, the socket still has a positive inflight count (1),
    thus the socket is excluded from candidates.  Then, the old GC lose the
    chance to garbage-collect the socket.
    
    With the old GC, the repro continues to create true garbage that will
    never be freed nor detected by kmemleak as it's linked to the global
    inflight list.  That's why we couldn't even notice the issue.
    
    Fixes: 314001f0bf92 ("af_unix: Add OOB support")
    Reported-by: syzbot+7f7f201cc2668a8fd169@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=7f7f201cc2668a8fd169
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20240405221057.2406-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 3c1ae6de74e3d2d6333d29a2d3e13e6094596c79
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 5 10:30:34 2024 +0000

    geneve: fix header validation in geneve[6]_xmit_skb
    
    [ Upstream commit d8a6213d70accb403b82924a1c229e733433a5ef ]
    
    syzbot is able to trigger an uninit-value in geneve_xmit() [1]
    
    Problem : While most ip tunnel helpers (like ip_tunnel_get_dsfield())
    uses skb_protocol(skb, true), pskb_inet_may_pull() is only using
    skb->protocol.
    
    If anything else than ETH_P_IPV6 or ETH_P_IP is found in skb->protocol,
    pskb_inet_may_pull() does nothing at all.
    
    If a vlan tag was provided by the caller (af_packet in the syzbot case),
    the network header might not point to the correct location, and skb
    linear part could be smaller than expected.
    
    Add skb_vlan_inet_prepare() to perform a complete mac validation.
    
    Use this in geneve for the moment, I suspect we need to adopt this
    more broadly.
    
    v4 - Jakub reported v3 broke l2_tos_ttl_inherit.sh selftest
       - Only call __vlan_get_protocol() for vlan types.
    Link: https://lore.kernel.org/netdev/20240404100035.3270a7d5@kernel.org/
    
    v2,v3 - Addressed Sabrina comments on v1 and v2
    Link: https://lore.kernel.org/netdev/Zg1l9L2BNoZWZDZG@hog/
    
    [1]
    
    BUG: KMSAN: uninit-value in geneve_xmit_skb drivers/net/geneve.c:910 [inline]
     BUG: KMSAN: uninit-value in geneve_xmit+0x302d/0x5420 drivers/net/geneve.c:1030
      geneve_xmit_skb drivers/net/geneve.c:910 [inline]
      geneve_xmit+0x302d/0x5420 drivers/net/geneve.c:1030
      __netdev_start_xmit include/linux/netdevice.h:4903 [inline]
      netdev_start_xmit include/linux/netdevice.h:4917 [inline]
      xmit_one net/core/dev.c:3531 [inline]
      dev_hard_start_xmit+0x247/0xa20 net/core/dev.c:3547
      __dev_queue_xmit+0x348d/0x52c0 net/core/dev.c:4335
      dev_queue_xmit include/linux/netdevice.h:3091 [inline]
      packet_xmit+0x9c/0x6c0 net/packet/af_packet.c:276
      packet_snd net/packet/af_packet.c:3081 [inline]
      packet_sendmsg+0x8bb0/0x9ef0 net/packet/af_packet.c:3113
      sock_sendmsg_nosec net/socket.c:730 [inline]
      __sock_sendmsg+0x30f/0x380 net/socket.c:745
      __sys_sendto+0x685/0x830 net/socket.c:2191
      __do_sys_sendto net/socket.c:2203 [inline]
      __se_sys_sendto net/socket.c:2199 [inline]
      __x64_sys_sendto+0x125/0x1d0 net/socket.c:2199
     do_syscall_64+0xd5/0x1f0
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    
    Uninit was created at:
      slab_post_alloc_hook mm/slub.c:3804 [inline]
      slab_alloc_node mm/slub.c:3845 [inline]
      kmem_cache_alloc_node+0x613/0xc50 mm/slub.c:3888
      kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:577
      __alloc_skb+0x35b/0x7a0 net/core/skbuff.c:668
      alloc_skb include/linux/skbuff.h:1318 [inline]
      alloc_skb_with_frags+0xc8/0xbf0 net/core/skbuff.c:6504
      sock_alloc_send_pskb+0xa81/0xbf0 net/core/sock.c:2795
      packet_alloc_skb net/packet/af_packet.c:2930 [inline]
      packet_snd net/packet/af_packet.c:3024 [inline]
      packet_sendmsg+0x722d/0x9ef0 net/packet/af_packet.c:3113
      sock_sendmsg_nosec net/socket.c:730 [inline]
      __sock_sendmsg+0x30f/0x380 net/socket.c:745
      __sys_sendto+0x685/0x830 net/socket.c:2191
      __do_sys_sendto net/socket.c:2203 [inline]
      __se_sys_sendto net/socket.c:2199 [inline]
      __x64_sys_sendto+0x125/0x1d0 net/socket.c:2199
     do_syscall_64+0xd5/0x1f0
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    
    CPU: 0 PID: 5033 Comm: syz-executor346 Not tainted 6.9.0-rc1-syzkaller-00005-g928a87efa423 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
    
    Fixes: d13f048dd40e ("net: geneve: modify IP header check in geneve6_xmit_skb and geneve_xmit_skb")
    Reported-by: syzbot+9ee20ec1de7b3168db09@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/netdev/000000000000d19c3a06152f9ee4@google.com/
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Phillip Potter <phil@philpotter.co.uk>
    Cc: Sabrina Dubroca <sd@queasysnail.net>
    Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
    Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit f0a068de65d5b7358e9aff792716afa9333f3922
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Apr 4 20:27:38 2024 +0000

    xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
    
    [ Upstream commit 237f3cf13b20db183d3706d997eedc3c49eacd44 ]
    
    syzbot reported an illegal copy in xsk_setsockopt() [1]
    
    Make sure to validate setsockopt() @optlen parameter.
    
    [1]
    
     BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
     BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
     BUG: KASAN: slab-out-of-bounds in xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
    Read of size 4 at addr ffff888028c6cde3 by task syz-executor.0/7549
    
    CPU: 0 PID: 7549 Comm: syz-executor.0 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
    Call Trace:
     <TASK>
      __dump_stack lib/dump_stack.c:88 [inline]
      dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
      print_address_description mm/kasan/report.c:377 [inline]
      print_report+0x169/0x550 mm/kasan/report.c:488
      kasan_report+0x143/0x180 mm/kasan/report.c:601
      copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
      copy_from_sockptr include/linux/sockptr.h:55 [inline]
      xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
      do_sock_setsockopt+0x3af/0x720 net/socket.c:2311
      __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
      __do_sys_setsockopt net/socket.c:2343 [inline]
      __se_sys_setsockopt net/socket.c:2340 [inline]
      __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
     do_syscall_64+0xfb/0x240
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    RIP: 0033:0x7fb40587de69
    Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007fb40665a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
    RAX: ffffffffffffffda RBX: 00007fb4059abf80 RCX: 00007fb40587de69
    RDX: 0000000000000005 RSI: 000000000000011b RDI: 0000000000000006
    RBP: 00007fb4058ca47a R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000020001980 R11: 0000000000000246 R12: 0000000000000000
    R13: 000000000000000b R14: 00007fb4059abf80 R15: 00007fff57ee4d08
     </TASK>
    
    Allocated by task 7549:
      kasan_save_stack mm/kasan/common.c:47 [inline]
      kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
      poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
      __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
      kasan_kmalloc include/linux/kasan.h:211 [inline]
      __do_kmalloc_node mm/slub.c:3966 [inline]
      __kmalloc+0x233/0x4a0 mm/slub.c:3979
      kmalloc include/linux/slab.h:632 [inline]
      __cgroup_bpf_run_filter_setsockopt+0xd2f/0x1040 kernel/bpf/cgroup.c:1869
      do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293
      __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
      __do_sys_setsockopt net/socket.c:2343 [inline]
      __se_sys_setsockopt net/socket.c:2340 [inline]
      __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
     do_syscall_64+0xfb/0x240
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    
    The buggy address belongs to the object at ffff888028c6cde0
     which belongs to the cache kmalloc-8 of size 8
    The buggy address is located 1 bytes to the right of
     allocated 2-byte region [ffff888028c6cde0, ffff888028c6cde2)
    
    The buggy address belongs to the physical page:
    page:ffffea0000a31b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888028c6c9c0 pfn:0x28c6c
    anon flags: 0xfff00000000800(slab|node=0|zone=1|lastcpupid=0x7ff)
    page_type: 0xffffffff()
    raw: 00fff00000000800 ffff888014c41280 0000000000000000 dead000000000001
    raw: ffff888028c6c9c0 0000000080800057 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 6648, tgid 6644 (syz-executor.0), ts 133906047828, free_ts 133859922223
      set_page_owner include/linux/page_owner.h:31 [inline]
      post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533
      prep_new_page mm/page_alloc.c:1540 [inline]
      get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311
      __alloc_pages+0x256/0x680 mm/page_alloc.c:4569
      __alloc_pages_node include/linux/gfp.h:238 [inline]
      alloc_pages_node include/linux/gfp.h:261 [inline]
      alloc_slab_page+0x5f/0x160 mm/slub.c:2175
      allocate_slab mm/slub.c:2338 [inline]
      new_slab+0x84/0x2f0 mm/slub.c:2391
      ___slab_alloc+0xc73/0x1260 mm/slub.c:3525
      __slab_alloc mm/slub.c:3610 [inline]
      __slab_alloc_node mm/slub.c:3663 [inline]
      slab_alloc_node mm/slub.c:3835 [inline]
      __do_kmalloc_node mm/slub.c:3965 [inline]
      __kmalloc_node+0x2db/0x4e0 mm/slub.c:3973
      kmalloc_node include/linux/slab.h:648 [inline]
      __vmalloc_area_node mm/vmalloc.c:3197 [inline]
      __vmalloc_node_range+0x5f9/0x14a0 mm/vmalloc.c:3392
      __vmalloc_node mm/vmalloc.c:3457 [inline]
      vzalloc+0x79/0x90 mm/vmalloc.c:3530
      bpf_check+0x260/0x19010 kernel/bpf/verifier.c:21162
      bpf_prog_load+0x1667/0x20f0 kernel/bpf/syscall.c:2895
      __sys_bpf+0x4ee/0x810 kernel/bpf/syscall.c:5631
      __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
      __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
      __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
     do_syscall_64+0xfb/0x240
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    page last free pid 6650 tgid 6647 stack trace:
      reset_page_owner include/linux/page_owner.h:24 [inline]
      free_pages_prepare mm/page_alloc.c:1140 [inline]
      free_unref_page_prepare+0x95d/0xa80 mm/page_alloc.c:2346
      free_unref_page_list+0x5a3/0x850 mm/page_alloc.c:2532
      release_pages+0x2117/0x2400 mm/swap.c:1042
      tlb_batch_pages_flush mm/mmu_gather.c:98 [inline]
      tlb_flush_mmu_free mm/mmu_gather.c:293 [inline]
      tlb_flush_mmu+0x34d/0x4e0 mm/mmu_gather.c:300
      tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:392
      exit_mmap+0x4b6/0xd40 mm/mmap.c:3300
      __mmput+0x115/0x3c0 kernel/fork.c:1345
      exit_mm+0x220/0x310 kernel/exit.c:569
      do_exit+0x99e/0x27e0 kernel/exit.c:865
      do_group_exit+0x207/0x2c0 kernel/exit.c:1027
      get_signal+0x176e/0x1850 kernel/signal.c:2907
      arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:310
      exit_to_user_mode_loop kernel/entry/common.c:105 [inline]
      exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
      __syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline]
      syscall_exit_to_user_mode+0xc9/0x360 kernel/entry/common.c:212
      do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89
     entry_SYSCALL_64_after_hwframe+0x6d/0x75
    
    Memory state around the buggy address:
     ffff888028c6cc80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
     ffff888028c6cd00: fa fc fc fc fa fc fc fc 00 fc fc fc 06 fc fc fc
    >ffff888028c6cd80: fa fc fc fc fa fc fc fc fa fc fc fc 02 fc fc fc
                                                           ^
     ffff888028c6ce00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
     ffff888028c6ce80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
    
    Fixes: 423f38329d26 ("xsk: add umem fill queue support and mmap")
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: "Björn Töpel" <bjorn@kernel.org>
    Cc: Magnus Karlsson <magnus.karlsson@intel.com>
    Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
    Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/r/20240404202738.3634547-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit a9dca26b745e6ae1c75b1842a0556ec8bad42d97
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri Dec 10 21:29:59 2021 +0100

    u64_stats: Disable preemption on 32bit UP+SMP PREEMPT_RT during updates.
    
    [ Upstream commit 3c118547f87e930d45a5787e386734015dd93b32 ]
    
    On PREEMPT_RT the seqcount_t for synchronisation is required on 32bit
    architectures even on UP because the softirq (and the threaded IRQ handler) can
    be preempted.
    
    With the seqcount_t for synchronisation, a reader with higher priority can
    preempt the writer and then spin endlessly in read_seqcount_begin() while the
    writer can't make progress.
    
    To avoid such a lock up on PREEMPT_RT the writer must disable preemption during
    the update. There is no need to disable interrupts because no writer is using
    this API in hard-IRQ context on PREEMPT_RT.
    
    Disable preemption on 32bit-RT within the u64_stats write section.
    
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Stable-dep-of: 38a15d0a50e0 ("u64_stats: fix u64_stats_init() for lockdep when used repeatedly in one file")
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 11e04135b0873919504059644d2dda0d8432ff89
Author: Ilya Maximets <i.maximets@ovn.org>
Date:   Wed Apr 3 22:38:01 2024 +0200

    net: openvswitch: fix unwanted error log on timeout policy probing
    
    [ Upstream commit 4539f91f2a801c0c028c252bffae56030cfb2cae ]
    
    On startup, ovs-vswitchd probes different datapath features including
    support for timeout policies.  While probing, it tries to execute
    certain operations with OVS_PACKET_ATTR_PROBE or OVS_FLOW_ATTR_PROBE
    attributes set.  These attributes tell the openvswitch module to not
    log any errors when they occur as it is expected that some of the
    probes will fail.
    
    For some reason, setting the timeout policy ignores the PROBE attribute
    and logs a failure anyway.  This is causing the following kernel log
    on each re-start of ovs-vswitchd:
    
      kernel: Failed to associated timeout policy `ovs_test_tp'
    
    Fix that by using the same logging macro that all other messages are
    using.  The message will still be printed at info level when needed
    and will be rate limited, but with a net rate limiter instead of
    generic printk one.
    
    The nf_ct_set_timeout() itself will still print some info messages,
    but at least this change makes logging in openvswitch module more
    consistent.
    
    Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action")
    Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
    Acked-by: Eelco Chaudron <echaudro@redhat.com>
    Link: https://lore.kernel.org/r/20240403203803.2137962-1-i.maximets@ovn.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 8c820f7c8e9b46238d277c575392fe9930207aab
Author: Dan Carpenter <dan.carpenter@linaro.org>
Date:   Tue Apr 2 12:56:54 2024 +0300

    scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
    
    [ Upstream commit 4406e4176f47177f5e51b4cc7e6a7a2ff3dbfbbd ]
    
    The app_reply->elem[] array is allocated earlier in this function and it
    has app_req.num_ports elements.  Thus this > comparison needs to be >= to
    prevent memory corruption.
    
    Fixes: 7878f22a2e03 ("scsi: qla2xxx: edif: Add getfcinfo and statistic bsgs")
    Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
    Link: https://lore.kernel.org/r/5c125b2f-92dd-412b-9b6f-fc3a3207bd60@moroto.mountain
    Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 5562dbfcf59b40bd1c7936d21d85cc2fab81284f
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Thu Apr 4 18:02:25 2024 +0200

    nouveau: fix function cast warning
    
    [ Upstream commit 185fdb4697cc9684a02f2fab0530ecdd0c2f15d4 ]
    
    Calling a function through an incompatible pointer type causes breaks
    kcfi, so clang warns about the assignment:
    
    drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadowof.c:73:10: error: cast from 'void (*)(const void *)' to 'void (*)(void *)' converts to incompatible function type [-Werror,-Wcast-function-type-strict]
       73 |         .fini = (void(*)(void *))kfree,
    
    Avoid this with a trivial wrapper.
    
    Fixes: c39f472e9f14 ("drm/nouveau: remove symlinks, move core/ to nvkm/ (no code changes)")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Danilo Krummrich <dakr@redhat.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240404160234.2923554-1-arnd@kernel.org
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 8d278fc34cdd8a44e995fa93dfd31d619a2e1fe6
Author: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
Date:   Thu Apr 4 19:14:48 2024 +0100

    Revert "drm/qxl: simplify qxl_fence_wait"
    
    [ Upstream commit 07ed11afb68d94eadd4ffc082b97c2331307c5ea ]
    
    This reverts commit 5a838e5d5825c85556011478abde708251cc0776.
    
    Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
    result in a '[TTM] Buffer eviction failed' exception whenever it reached a
    timeout.
    Due to a dependency to DMA_FENCE_WARN this also restores some code deleted
    by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").
    
    Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
    Link: https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/
    Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
    Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
    Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
    Signed-off-by: Maxime Ripard <mripard@kernel.org>
    Link: https://patchwork.freedesktop.org/patch/msgid/20240404181448.1643-2-dreaming.about.electric.sheep@gmail.com
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 42beda7db44f82d66c7bcb9a9fe13c3840af9f05
Author: Frank Li <Frank.Li@nxp.com>
Date:   Fri Mar 22 12:47:05 2024 -0400

    arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
    
    [ Upstream commit c6ddd6e7b166532a0816825442ff60f70aed9647 ]
    
    The actual clock show wrong frequency:
    
       echo on >/sys/devices/platform/bus\@5b000000/5b010000.mmc/power/control
       cat /sys/kernel/debug/mmc0/ios
    
       clock:          200000000 Hz
       actual clock:   166000000 Hz
                       ^^^^^^^^^
       .....
    
    According to
    
    sdhc0_lpcg: clock-controller@5b200000 {
                    compatible = "fsl,imx8qxp-lpcg";
                    reg = <0x5b200000 0x10000>;
                    #clock-cells = <1>;
                    clocks = <&clk IMX_SC_R_SDHC_0 IMX_SC_PM_CLK_PER>,
                             <&conn_ipg_clk>, <&conn_axi_clk>;
                    clock-indices = <IMX_LPCG_CLK_0>, <IMX_LPCG_CLK_4>,
                                    <IMX_LPCG_CLK_5>;
                    clock-output-names = "sdhc0_lpcg_per_clk",
                                         "sdhc0_lpcg_ipg_clk",
                                         "sdhc0_lpcg_ahb_clk";
                    power-domains = <&pd IMX_SC_R_SDHC_0>;
            }
    
    "per_clk" should be IMX_LPCG_CLK_0 instead of IMX_LPCG_CLK_5.
    
    After correct clocks order:
    
       echo on >/sys/devices/platform/bus\@5b000000/5b010000.mmc/power/control
       cat /sys/kernel/debug/mmc0/ios
    
       clock:          200000000 Hz
       actual clock:   198000000 Hz
                       ^^^^^^^^
       ...
    
    Fixes: 16c4ea7501b1 ("arm64: dts: imx8: switch to new lpcg clock binding")
    Signed-off-by: Frank Li <Frank.Li@nxp.com>
    Signed-off-by: Shawn Guo <shawnguo@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit cc7b83f04b4393a63df2f0c24eaaef60c39facdb
Author: Nini Song <nini.song@mediatek.com>
Date:   Thu Jan 25 21:28:45 2024 +0800

    media: cec: core: remove length check of Timer Status
    
    commit ce5d241c3ad4568c12842168288993234345c0eb upstream.
    
    The valid_la is used to check the length requirements,
    including special cases of Timer Status. If the length is
    shorter than 5, that means no Duration Available is returned,
    the message will be forced to be invalid.
    
    However, the description of Duration Available in the spec
    is that this parameter may be returned when these cases, or
    that it can be optionally return when these cases. The key
    words in the spec description are flexible choices.
    
    Remove the special length check of Timer Status to fit the
    spec which is not compulsory about that.
    
    Signed-off-by: Nini Song <nini.song@mediatek.com>
    Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 75193678cce993aa959e7764b6df2f599886dd06
Author: Dmitry Antipov <dmantipov@yandex.ru>
Date:   Tue Apr 2 14:32:05 2024 +0300

    Bluetooth: Fix memory leak in hci_req_sync_complete()
    
    commit 45d355a926ab40f3ae7bc0b0a00cb0e3e8a5a810 upstream.
    
    In 'hci_req_sync_complete()', always free the previous sync
    request state before assigning reference to a new one.
    
    Reported-by: syzbot+39ec16ff6cc18b1d066d@syzkaller.appspotmail.com
    Closes: https://syzkaller.appspot.com/bug?extid=39ec16ff6cc18b1d066d
    Cc: stable@vger.kernel.org
    Fixes: f60cb30579d3 ("Bluetooth: Convert hci_req_sync family of function to new request API")
    Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 53e494b7bc432e1efb295b1cfecda1164401f3d5
Author: Steven Rostedt (Google) <rostedt@goodmis.org>
Date:   Tue Apr 9 15:13:09 2024 -0400

    ring-buffer: Only update pages_touched when a new page is touched
    
    commit ffe3986fece696cf65e0ef99e74c75f848be8e30 upstream.
    
    The "buffer_percent" logic that is used by the ring buffer splice code to
    only wake up the tasks when there's no data after the buffer is filled to
    the percentage of the "buffer_percent" file is dependent on three
    variables that determine the amount of data that is in the ring buffer:
    
     1) pages_read - incremented whenever a new sub-buffer is consumed
     2) pages_lost - incremented every time a writer overwrites a sub-buffer
     3) pages_touched - incremented when a write goes to a new sub-buffer
    
    The percentage is the calculation of:
    
      (pages_touched - (pages_lost + pages_read)) / nr_pages
    
    Basically, the amount of data is the total number of sub-bufs that have been
    touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
    divided by the total count to give the buffer percentage. When the
    percentage is greater than the value in the "buffer_percent" file, it
    wakes up splice readers waiting for that amount.
    
    It was observed that over time, the amount read from the splice was
    constantly decreasing the longer the trace was running. That is, if one
    asked for 60%, it would read over 60% when it first starts tracing, but
    then it would be woken up at under 60% and would slowly decrease the
    amount of data read after being woken up, where the amount becomes much
    less than the buffer percent.
    
    This was due to an accounting of the pages_touched incrementation. This
    value is incremented whenever a writer transfers to a new sub-buffer. But
    the place where it was incremented was incorrect. If a writer overflowed
    the current sub-buffer it would go to the next one. If it gets preempted
    by an interrupt at that time, and the interrupt performs a trace, it too
    will end up going to the next sub-buffer. But only one should increment
    the counter. Unfortunately, that was not the case.
    
    Change the cmpxchg() that does the real switch of the tail-page into a
    try_cmpxchg(), and on success, perform the increment of pages_touched. This
    will only increment the counter once for when the writer moves to a new
    sub-buffer, and not when there's a race and is incremented for when a
    writer and its preempting writer both move to the same new sub-buffer.
    
    Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home
    
    Cc: stable@vger.kernel.org
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
    Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
    Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 87b6af1a7683e021710c08fc0551fc078346032f
Author: Sven Eckelmann <sven@narfation.org>
Date:   Mon Feb 12 13:58:33 2024 +0100

    batman-adv: Avoid infinite loop trying to resize local TT
    
    commit b1f532a3b1e6d2e5559c7ace49322922637a28aa upstream.
    
    If the MTU of one of an attached interface becomes too small to transmit
    the local translation table then it must be resized to fit inside all
    fragments (when enabled) or a single packet.
    
    But if the MTU becomes too low to transmit even the header + the VLAN
    specific part then the resizing of the local TT will never succeed. This
    can for example happen when the usable space is 110 bytes and 11 VLANs are
    on top of batman-adv. In this case, at least 116 byte would be needed.
    There will just be an endless spam of
    
       batman_adv: batadv0: Forced to purge local tt entries to fit new maximum fragment MTU (110)
    
    in the log but the function will never finish. Problem here is that the
    timeout will be halved all the time and will then stagnate at 0 and
    therefore never be able to reduce the table even more.
    
    There are other scenarios possible with a similar result. The number of
    BATADV_TT_CLIENT_NOPURGE entries in the local TT can for example be too
    high to fit inside a packet. Such a scenario can therefore happen also with
    only a single VLAN + 7 non-purgable addresses - requiring at least 120
    bytes.
    
    While this should be handled proactively when:
    
    * interface with too low MTU is added
    * VLAN is added
    * non-purgeable local mac is added
    * MTU of an attached interface is reduced
    * fragmentation setting gets disabled (which most likely requires dropping
      attached interfaces)
    
    not all of these scenarios can be prevented because batman-adv is only
    consuming events without the the possibility to prevent these actions
    (non-purgable MAC address added, MTU of an attached interface is reduced).
    It is therefore necessary to also make sure that the code is able to handle
    also the situations when there were already incompatible system
    configuration are present.
    
    Cc: stable@vger.kernel.org
    Fixes: a19d3d85e1b8 ("batman-adv: limit local translation table max size")
    Reported-by: syzbot+a6a4b5bb3da165594cff@syzkaller.appspotmail.com
    Signed-off-by: Sven Eckelmann <sven@narfation.org>
    Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>