Skip to content

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, and state save/restore#342

Open
JiandiAnNVIDIA wants to merge 143 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl_2026-03-04
Open

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, and state save/restore#342
JiandiAnNVIDIA wants to merge 143 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl_2026-03-04

Conversation

@JiandiAnNVIDIA
Copy link
Copy Markdown

@JiandiAnNVIDIA JiandiAnNVIDIA commented Mar 12, 2026

Description

This patch series adds comprehensive CXL (Compute Express Link) support to the
nvidia-6.17 kernel, including:

  1. CXL Type-2 device support - Enables accelerator devices (like GPUs and
    SmartNICs) to use CXL for coherent memory access via firmware-provisioned
    regions
  2. CXL RAS (Reliability, Availability, Serviceability) error handling -
    Implements PCIe Port Protocol error handling and logging for CXL Root Ports,
    Downstream Switch Ports, and Upstream Switch Ports
  3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
    registers and HDM decoder programming across PCI resets and link transitions,
    enabling device re-initialization after reset for firmware-provisioned
    configurations
  4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
    Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
    including memory offlining, cache flushing, multi-function sibling
    coordination, and DVSEC reset sequencing
  5. Multi-level interleaving fix - Supports firmware-configured CXL
    interleaving where lower levels use smaller granularities than parent ports
    (reverse HPA bit ordering)
  6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from
    upstream torvalds/master covering the range from v6.17.9 to the merge
    point of Terry Bowman's v14 series into v7.0
  7. CXL DAX support - Enables direct memory access to CXL RAM regions and
    mapping CXL DAX devices as System-RAM

Key Features Added:

  • CXL Type-2 accelerator device registration and memory management
  • CXL region creation by Type-2 drivers
  • DPA (Device Physical Address) allocation interface for accelerators
  • HPA (Host Physical Address) free space enumeration
  • Multi-level CXL address translation (SPA↔HPA↔DPA)
  • CXL protocol error detection, forwarding, and recovery
  • CXL RAS error handling for Endpoints, RCH, and Switch Ports
    (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
  • CXL extended linear cache region support
  • CXL DVSEC and HDM decoder state save/restore across PCI resets
  • CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
    Type-2 devices with Reset Capable bit set
  • Multi-function sibling coordination during CXL reset via Non-CXL
    Function Map DVSEC
  • CPU cache flush using cpu_cache_invalidate_memregion() during reset
  • Multi-level interleaving with smaller granularities for lower decoder
    levels (firmware-provisioned configurations)
  • CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
    (DEV_DAX_KMEM)
  • CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)

Justification

CXL Type-2 device support is critical for next-generation NVIDIA accelerators
and data center workloads:

  • Enables coherent memory sharing between CPUs and accelerators
  • Supports firmware-provisioned CXL regions for accelerator memory
  • Provides proper error handling and reporting for CXL fabric errors
  • Enables device reset and state recovery for CXL Type-2 devices
  • Preserves firmware-programmed DVSEC and HDM decoder state across resets
  • Required for upcoming NVIDIA hardware with CXL capabilities

Source

Patch Breakdown (139 patches + 1 revert):

# Category Count Source
1 Revert old CXL reset (f198764) 1 OOT (cleanup)
2 Upstream CXL/PCI prerequisite cherry-picks 103 Upstream torvalds/master (v6.17.9 → merge of Terry Bowman v14 into v7.0)
3 Smita Koralahalli's CXL EINJ series v6 patch 3/9 1 LKML (v6, not yet merged)
4 Alejandro Lucero's CXL Type-2 series v23 22 LKML (v23, not yet merged)
5 Robert Richter's multi-level interleaving fix 1 LKML (v1, not yet merged)
6 Srirangan Madhavan's CXL state save/restore series 5 LKML (v1, not yet merged)
7 Srirangan Madhavan's CXL reset series 7 LKML (v5, not yet merged)
8 Config annotations update 3 OOT (build config)
TOTAL 143

Notes on the upstream cherry-picks (item 2):

The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
This range includes 17 out of 34 patches from Terry Bowman's v14 series
that were reworked by the CXL maintainer and merged into v7.0 via the
for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
were refactored into v15 (9 patches, not yet merged) and are not included
in this port.

Notes on the save/restore and reset series (items 6–7):

Srirangan's patches were authored against upstream v7.0-rc1 (which does not
include Alejandro's v23 Type-2 series). For this port, the header
reorganization in patch 2/5 of the save/restore series was adapted to align
with Alejandro's v23 approach: HDM decoder and register map definitions were
moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
patch) to follow the convention established by Alejandro's series. Upstream
reviewers have indicated that Srirangan's series should be rebased on top of
Alejandro's once it merges.

Lore Links:

Upstream Status:

Series Status
103 upstream cherry-picks ✅ Merged in torvalds/master (v7.0 range)
Terry Bowman v14 (17 patches) ✅ Merged into v7.0 via for-7.0/cxl-aer-prep
Terry Bowman v15 (9 patches) ⏳ Under review, not needed for this port
Smita v6 patch 3/9 ⏳ Under review, not yet merged
Alejandro v23 (22 patches) ⏳ Under review, not yet merged
Robert Richter v1 (1 patch) ⏳ Under review, not yet merged
Srirangan save/restore (5 patches) ⏳ Under review, not yet merged
Srirangan cxl_reset v5 (7 patches) ⏳ Under review, not yet merged

Testing

Build Validation:

  • Built successfully for ARM64 4K page size kernel
  • Built successfully for ARM64 64K page size kernel

Config Verification:

CXL-related configs enabled as expected:

CONFIG_ACPI_APEI_EINJ_CXL=y CONFIG_PCI_CXL=y CONFIG_CXL_BUS=y CONFIG_CXL_PCI=y CONFIG_CXL_MEM_RAW_COMMANDS=y CONFIG_CXL_ACPI=m CONFIG_CXL_PMEM=m CONFIG_CXL_MEM=y CONFIG_CXL_FEATURES=y # CONFIG_CXL_EDAC_MEM_FEATURES is not set CONFIG_CXL_PORT=y CONFIG_CXL_SUSPEND=y CONFIG_CXL_REGION=y # CONFIG_CXL_REGION_INVALIDATION_TEST is not set CONFIG_CXL_RAS=y # CONFIG_CACHEMAINT_FOR_HOTPLUG is not set # CONFIG_SFC_CXL is not set CONFIG_CXL_PMU=m CONFIG_DEV_DAX=y CONFIG_DEV_DAX_PMEM=m CONFIG_DEV_DAX_HMEM=m CONFIG_DEV_DAX_CXL=y CONFIG_DEV_DAX_HMEM_DEVICES=y CONFIG_DEV_DAX_KMEM=y CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y CONFIG_GENERIC_CPU_CACHE_MAINTENANCE=y 

Runtime Testing:

  • Boot test on ARM64 system
  • CXL device enumeration test (ls /sys/bus/cxl/devices/)
  • CXL reset test (echo 1 > /sys/bus/pci/devices/<dev>/cxl_reset)
  • DVSEC save/restore verified (CXLCtl, Range registers preserved)

Notes

  • CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
    d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
    The debian.master annotation for PCIEAER_CXL=y is overridden to -
    in debian.nvidia-6.17/config/annotations.
  • CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
    remain tristate (not bool) — the v14 series kept them as tristate,
    unlike earlier draft versions.
  • CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
    overridden from m (debian.master default) to y to support built-in
    CXL RAM region DAX access and System-RAM mapping.
  • CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
    series; auto-enabled when CXL_BUS=y. Gates compilation of
    drivers/pci/cxl.o for DVSEC and HDM state save/restore.
  • CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and
    CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
    introduced by the upstream cherry-picks; arm64 auto-selects both.
    cpu_cache_invalidate_memregion() is also used by the CXL reset
    series for cache flushing during reset.
  • Kernel config annotations updated in debian.nvidia-6.17/config/annotations
    to reflect all of the above changes.
  • Srirangan's save/restore series header reorganization was adapted to
    align with Alejandro's v23 approach (include/cxl/cxl.h instead of
    include/cxl/pci.h). See commit message on patch 2/5 for details.
JiandiAnNVIDIA and others added 30 commits March 6, 2026 02:36
This reverts commit f198764. The CXL reset implementation is being reverted to allow "NVIDIA: VR: SAUCE: CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h" to apply cleanly. The reset functionality will be replaced by the version currently being pursued upstream. Signed-off-by: Jiandi An <jan@nvidia.com>
Use the string choice helper function str_plural() to simplify the code. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://patch.msgid.link/20250811122519.543554-1-zhao.xichao@vivo.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 22fb4ad) Signed-off-by: Jiandi An <jan@nvidia.com>
Replace ternary operator with str_enabled_disabled() helper to enhance code readability and consistency. [dj: Fix spelling in commit log and subject. ] Signed-off-by: Nai-Chen Cheng <bleach1827@gmail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250812-cxl-region-string-choices-v1-1-50200b0bc782@gmail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 733c4e9) Signed-off-by: Jiandi An <jan@nvidia.com>
The root decoder's HPA to SPA translation logic was implemented using a single function pointer. In preparation for additional per-decoder callbacks, convert this into a struct cxl_rd_ops and move the hpa_to_spa pointer into it. To avoid maintaining a static ops instance populated with mostly NULL pointers, allocate the ops structure dynamically only when a platform requires overrides (e.g. XOR interleave decoding). The setup can be extended as additional callbacks are added. Co-developed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/818530c82c351a9c0d3a204f593068dd2126a5a9.1754290144.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 524b2b7) Signed-off-by: Jiandi An <jan@nvidia.com>
When DPA->SPA translation was introduced, it included a helper that applied the XOR maps to do the CXL HPA -> SPA translation for XOR region interleaves. In preparation for adding SPA->DPA address translation, introduce the reverse callback. The root decoder callback is defined generically and not all usages may be self inverting like this XOR function. Add another root decoder callback that is the spa_to_hpa function. Update the existing cxl_xor_hpa_to_spa() with a name that reflects what it does without directionality: cxl_apply_xor_maps(), a generic parameter: addr replaces hpa, and code comments stating that the function supports the translation in either direction. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/79d9d72230c599cae94d7221781ead6392ae6d3f.1754290144.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit b83ee96) Signed-off-by: Jiandi An <jan@nvidia.com>
Add infrastructure to translate System Physical Addresses (SPA) to Device Physical Addresses (DPA) within CXL regions. This capability will be used by follow-on patches that add poison inject and clear operations at the region level. The SPA-to-DPA translation process follows these steps: 1. Apply root decoder transformations (SPA to HPA) if configured. 2. Extract the position in region interleave from the HPA offset. 3. Extract the DPA offset from the HPA offset. 4. Use position to find endpoint decoder. 5. Use endpoint decoder to find memdev and calculate DPA from offset. 6. Return the result - a memdev and a DPA. It is Step 1 above that makes this a driver level operation and not work we can push to user space. Rather than exporting the XOR maps for root decoders configured with XOR interleave, the driver performs this complex calculation for the user. Steps 2 and 3 follow the CXL Spec 3.2 Section 8.2.4.20.13 Implementation Note: Device Decode Logic. These calculations mirror much of the logic introduced earlier in DPA to SPA translation, see cxl_dpa_to_hpa(), where the driver needed to reverse the spec defined 'Device Decode Logic'. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/422f0e27742c6ca9a11f7cd83e6ba9fa1a8d0c74.1754290144.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit dc18117) Signed-off-by: Jiandi An <jan@nvidia.com>
The core functions that validate and send inject and clear commands to the memdev devices require holding both the dpa_rwsem and the region_rwsem. In preparation for another caller of these functions that must hold the locks upon entry, split the work into a locked and unlocked pair. Consideration was given to moving the locking to both callers, however, the existing caller is not in the core (mem.c) and cannot access the locks. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/1d601f586975195733984ca63d1b5789bbe8690f.1754290144.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 25a0207) Signed-off-by: Jiandi An <jan@nvidia.com>
Add CXL region debugfs attributes to inject and clear poison based on an offset into the region. These new interfaces allow users to operate on poison at the region level without needing to resolve Device Physical Addresses (DPA) or target individual memdevs. The implementation uses a new helper, region_offset_to_dpa_result() that applies decoder interleave logic, including XOR-based address decoding when applicable. Note that XOR decodes rely on driver internal xormaps which are not exposed to userspace. So, this support is not only a simplification of poison operations that could be done using existing per memdev operations, but also it enables this functionality for XOR interleaved regions for the first time. New debugfs attributes are added in /sys/kernel/debug/cxl/regionX/: inject_poison and clear_poison. These are only exposed if all memdevs participating in the region support both inject and clear commands, ensuring consistent and reliable behavior across multi-device regions. If tracing is enabled, these operations are logged as cxl_poison events in /sys/kernel/tracing/trace. The ABI documentation warns users of the significant risks that come with using these capabilities. A CXL Maturity Map update shows this user flow is now supported. Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/f3fd8628ab57ea79704fb2d645902cd499c066af.1754290144.git.alison.schofield@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit c3dd676) Signed-off-by: Jiandi An <jan@nvidia.com>
…fset() 0day reported warnings of: drivers/cxl/core/region.c:3664:25: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=] drivers/cxl/core/region.c:3671:37: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=] Replace %#llx with %pr to emit resource_size_t arguments. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202508160513.NAZ9i9rQ-lkp@intel.com/ Cc: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250818153953.3658952-1-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit e6a9530) Signed-off-by: Jiandi An <jan@nvidia.com>
Add clarification to comment for memory hotplug callback ordering as the current comment does not provide clear language on which callback happens first. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20250829222907.1290912-2-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 6512886) Signed-off-by: Jiandi An <jan@nvidia.com>
Add helper function node_update_perf_attrs() to allow update of node access coordinates computed by an external agent such as CXL. The helper allows updating of coordinates after the attribute being created by HMAT. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20250829222907.1290912-3-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit b57fc65) Signed-off-by: Jiandi An <jan@nvidia.com>
…ough HMAT The current implementation of CXL memory hotplug notifier gets called before the HMAT memory hotplug notifier. The CXL driver calculates the access coordinates (bandwidth and latency values) for the CXL end to end path (i.e. CPU to endpoint). When the CXL region is onlined, the CXL memory hotplug notifier writes the access coordinates to the HMAT target structs. Then the HMAT memory hotplug notifier is called and it creates the access coordinates for the node sysfs attributes. During testing on an Intel platform, it was found that although the newly calculated coordinates were pushed to sysfs, the sysfs attributes for the access coordinates showed up with the wrong initiator. The system has 4 nodes (0, 1, 2, 3) where node 0 and 1 are CPU nodes and node 2 and 3 are CXL nodes. The expectation is that node 2 would show up as a target to node 0: /sys/devices/system/node/node2/access0/initiators/node0 However it was observed that node 2 showed up as a target under node 1: /sys/devices/system/node/node2/access0/initiators/node1 The original intent of the 'ext_updated' flag in HMAT handling code was to stop HMAT memory hotplug callback from clobbering the access coordinates after CXL has injected its calculated coordinates and replaced the generic target access coordinates provided by the HMAT table in the HMAT target structs. However the flag is hacky at best and blocks the updates from other CXL regions that are onlined in the same node later on. Remove the 'ext_updated' flag usage and just update the access coordinates for the nodes directly without touching HMAT target data. The hotplug memory callback ordering is changed. Instead of changing CXL, move HMAT back so there's room for the levels rather than have CXL share the same level as SLAB_CALLBACK_PRI. The change will resulting in the CXL callback to be executed after the HMAT callback. With the change, the CXL hotplug memory notifier runs after the HMAT callback. The HMAT callback will create the node sysfs attributes for access coordinates. The CXL callback will write the access coordinates to the now created node sysfs attributes directly and will not pollute the HMAT target values. A nodemask is introduced to keep track if a node has been updated and prevents further updates. Fixes: 067353a ("cxl/region: Add memory hotplug notifier for cxl region") Cc: stable@vger.kernel.org Tested-by: Marc Herbert <marc.herbert@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20250829222907.1290912-4-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 2e454fb) Signed-off-by: Jiandi An <jan@nvidia.com>
Remove deadcode since CXL no longer calls hmat_update_target_coordinates(). Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20250829222907.1290912-5-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit e99ecbc) Signed-off-by: Jiandi An <jan@nvidia.com>
Fixed the following typo errors intersparsed ==> interspersed in Documentation/driver-api/cxl/platform/bios-and-efi.rst Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Gregory Price <gourry@gourry.net> Link: https://patch.msgid.link/20250818175335.5312-1-rakuram.e96@gmail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit a414408) Signed-off-by: Jiandi An <jan@nvidia.com>
ACPICA commit 710745713ad3a2543dbfb70e84764f31f0e46bdc This has been renamed in more recent CXL specs, as type3 (memory expanders) can also use HDM-DB for device coherent memory. Link: acpica/acpica@7107457 Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://patch.msgid.link/20250908160034.86471-1-dave@stgolabs.net Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit c427290) Signed-off-by: Jiandi An <jan@nvidia.com>
…olution Add documentation on how to resolve conflicts between CXL Fixed Memory Windows, Platform Low Memory Holes, intermediate Switch and Endpoint Decoders. [dj]: Fixed inconsistent spacing after '.' [dj]: Fixed subject line from Alison. [dj]: Removed '::' before table from Bagas. Reviewed-by: Gregory Price <gourry@gourry.net> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit c5dca38) Signed-off-by: Jiandi An <jan@nvidia.com>
Add a helper to replace the open code detection of CXL device hierarchy root, or the host bridge. The helper will be used for delayed downstream port (dport) creation. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Robert Richter <rrichter@amd.com> Tested-by: Robert Richter <rrichter@amd.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 4fde895) Signed-off-by: Jiandi An <jan@nvidia.com>
Refactor the code in reap_dports() out to provide a helper function that reaps a single dport. This will be used later in the cleanup path for allocating a dport. Renaming to del_port() and del_dports() to mirror devm_cxl_add_dport(). [dj] Fixed up subject per Robert Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Robert Richter <rrichter@amd.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 8330671) Signed-off-by: Jiandi An <jan@nvidia.com>
Add a cached copy of the hardware port-id list that is available at init before all @DPORT objects have been instantiated. Change is in preparation of delayed dport instantiation. Reviewed-by: Robert Richter <rrichter@amd.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Tested-by: Robert Richter <rrichter@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 02edab6) Signed-off-by: Jiandi An <jan@nvidia.com>
Group the decoder setup code in switch and endpoint port probe into a single function for each to reduce the number of functions to be mocked in cxl_test. Introduce devm_cxl_switch_port_decoders_setup() and devm_cxl_endpoint_decoders_setup(). These two functions will be mocked instead with some functions optimized out since the mock version does not do anything. Remove devm_cxl_setup_hdm(), devm_cxl_add_passthrough_decoder(), and devm_cxl_enumerate_decoders() in cxl_test mock code. In turn, mock_cxl_add_passthrough_decoder() can be removed since cxl_test does not setup passthrough decoders. __wrap_cxl_hdm_decode_init() and __wrap_cxl_dvsec_rr_decode() can be removed as well since they only return 0 when called. [dj: drop 'struct cxl_port' forward declaration (Robert)] Suggested-by: Robert Richter <rrichter@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Robert Richter <rrichter@amd.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 68d5d97) Signed-off-by: Jiandi An <jan@nvidia.com>
The current implementation enumerates the dports during the cxl_port driver probe. Without an endpoint connected, the dport may not be active during port probe. This scheme may prevent a valid hardware dport id to be retrieved and MMIO registers to be read when an endpoint is hot-plugged. Move the dport allocation and setup to behind memdev probe so the endpoint is guaranteed to be connected. In the original enumeration behavior, there are 3 phases (or 2 if no CXL switches) for port creation. cxl_acpi() creates a Root Port (RP) from the ACPI0017.N device. Through that it enumerates downstream ports composed of ACPI0016.N devices through add_host_bridge_dport(). Once done, it uses add_host_bridge_uport() to create the ports that enumerate the PCI RPs as the dports of these ports. Every time a port is created, the port driver is attached, cxl_switch_porbe_probe() is called and devm_cxl_port_enumerate_dports() is invoked to enumerate and probe the dports. The second phase is if there are any CXL switches. When the pci endpoint device driver (cxl_pci) calls probe, it will add a mem device and triggers the cxl_mem_probe(). cxl_mem_probe() calls devm_cxl_enumerate_ports() and attempts to discovery and create all the ports represent CXL switches. During this phase, a port is created per switch and the attached dports are also enumerated and probed. The last phase is creating endpoint port which happens for all endpoint devices. The new sequence is instead of creating all possible dports at initial port creation, defer port instantiation until a memdev beneath that dport arrives. Introduce devm_cxl_create_or_extend_port() to centralize the creation and extension of ports with new dports as memory devices arrive. As part of this rework, switch decoder target list is amended at runtime as dports show up. While the decoders are allocated during the port driver probe, The decoders must also be updated since previously they were setup when all the dports are setup. Now every time a dport is setup per endpoint, the switch target listing need to be updated with new dport. A guard(rwsem_write) is used to update decoder targets. This is similar to when decoder_populate_target() is called and the decoder programming must be protected. Also the port registers are probed the first time when the first dport shows up. This ensures that the CXL link is established when the port registers are probed. [dj] Use ERR_CAST() (Jonathan) Link: https://lore.kernel.org/linux-cxl/20250305100123.3077031-1-rrichter@amd.com/ Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 4f06d81) Signed-off-by: Jiandi An <jan@nvidia.com>
devm_cxl_add_dport_by_dev() outside of cxl_test is done through PCI hierarchy. However with cxl_test, it needs to be done through the platform device hierarchy. Add the mock function for devm_cxl_add_dport_by_dev(). When cxl_core calls a cxl_core exported function and that function is mocked by cxl_test, the call chain causes a circular dependency issue. Dan provided a workaround to avoid this issue. Apply the method to changes from the late dport allocation changes in order to enable cxl-test. In cxl_core they are defined with "__" added in front of the function. A macro is used to define the original function names for when non-test version of the kernel is built. A bit of macros and typedefs are used to allow mocking of those functions in cxl_test. Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Tested-by: Robert Richter <rrichter@amd.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit d96eb90) Signed-off-by: Jiandi An <jan@nvidia.com>
…tup() With devm_cxl_switch_port_decoders_setup() being called within cxl_core instead of by the port driver probe, adjustments are needed to deal with circular symbol dependency when this function is being mock'd. Add the appropriate changes to get around the circular dependency. Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 644685a) Signed-off-by: Jiandi An <jan@nvidia.com>
cxl_test uses mock functions for decoder enumaration. Add initialization of the cxld->target_map[] for cxl_test based decoders in the mock functions. Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Tested-by: Robert Richter <rrichter@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 87439b5) Signed-off-by: Jiandi An <jan@nvidia.com>
While cxl_switch_parse_cdat() is harmless to be run multiple times, it is not efficient in the current scheme where one dport is being updated at a time by the memdev probe path. Change the input parameter to the specific dport being updated to pick up the SSLBIS information for just that dport. Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Robert Richter <rrichter@amd.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit d64035a) Signed-off-by: Jiandi An <jan@nvidia.com>
This patch moves the port register setup to when the first dport appears via the memdev probe path. At this point, the CXL link should be established and the register access is expected to succeed. This change addresses an error message observed when PCIe hotplug is enabled on an Intel platform. The error messages "cxl portN: Couldn't locate the CXL.cache and CXL.mem capability array header" is observed for the host bridge (CHBCR) during cxl_acpi driver probe. If the cxl_acpi module probe is running before the CXL link between the endpoint device and the RP is established, then the platform may not have exposed DVSEC ID 3 and/or DVSEC ID 7 blocks which will trigger the error message. This behavior is defined by the CXL spec r3.2 9.12.3 for RPs and DSPs, however the Intel platform also added this behavior to the host bridge. This change also needs the dport enumeration to be moved to the memdev probe path in order to address the issue. This change is not a wholly contained solution by itself. [dj: Add missing var init during port alloc] Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Tested-by: Robert Richter <rrichter@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit f6ee249) Signed-off-by: Jiandi An <jan@nvidia.com>
port->nr_dports is used to represent how many dports added to the cxl port, it will increase in add_dport() when a new dport is being added to the cxl port, but it will not be reduced when a dport is removed from the cxl port. Currently, when the first dport is added to a cxl port, it will trigger component registers setup on the cxl port, the implementation is using port->nr_dports to confirm if the dport is the first dport. A corner case here is that adding dport could fail after port->nr_dports updating and before checking port->nr_dports for component registers setup. If the failure happens during the first dport attaching, it will cause that CXL subsystem has not chance to execute component registers setup for the cxl port. the failure flow like below: port->nr_dports = 0 dport 1 adding to the port:	add_dport()	# port->nr_dports: 1	failed on devm_add_action_or_reset() or sysfs_create_link()	return error	# port->nr_dports: 1 dport 2 adding to the port:	add_dport()	# port->nr_dports: 2	no failure	skip component registers setup because of port->nr_dports is 2 The solution here is that moving component registers setup closer to add_dport(), so if add_dport() is executed correctly for the first dport, component registers setup on the port will be executed immediately after that. Fixes: f6ee249 ("cxl: Move port register setup to when first dport appear") Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 02e7567) Signed-off-by: Jiandi An <jan@nvidia.com>
KASAN reports a stack-out-of-bounds access in validate_region_offset() while running the cxl-poison.sh unit test because the printk format specifier, %pr format, is not a match for the resource_size_t type of the variables. %pr expects struct resource pointers and attempts to dereference the structure fields, reading beyond the bounds of the stack variables. Since these messages emit an 'A exceeds B' type of message, keep the resource_size_t's and use the %pa specifier to be architecture safe. BUG: KASAN: stack-out-of-bounds in resource_string.isra.0+0xe9a/0x1690 [] Read of size 8 at addr ffff88800a7afb40 by task bash/1397 ... [] The buggy address belongs to stack of task bash/1397 [] and is located at offset 56 in frame: [] validate_region_offset+0x0/0x1c0 [cxl_core] Fixes: c3dd676 ("cxl/region: Add inject and clear poison by region offset") Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 257c4b0) Signed-off-by: Jiandi An <jan@nvidia.com>
The HPA to DPA translation for poison injection assumes that the base address starts from where the CXL region begins. When the extended linear cache is active, the offset can be within the DRAM region. Adjust the offset so that it correctly reflects the offset within the CXL region. [ dj: Add fixes tag from Alison ] Fixes: c3dd676 ("cxl/region: Add inject and clear poison by region offset") Link: https://patch.msgid.link/20251031173224.3537030-5-dave.jiang@intel.com Reviewed-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit b6cfddd) Signed-off-by: Jiandi An <jan@nvidia.com>
The node/zone quirk section of the cxl documentation is incorrect. The actual reason for fallback allocation misbehavior in the described configuration is due to a kswapd/reclaim thrashing scenario fixed by the linked patch. Remove this section. Link: https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/ Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> (cherry picked from commit 82b5d7e) Signed-off-by: Jiandi An <jan@nvidia.com>
alucerop and others added 21 commits March 24, 2026 11:58
Region creation based on Type3 devices is triggered from user space allowing memory combination through interleaving. In preparation for kernel driven region creation, that is Type2 drivers triggering region creation backed with its advertised CXL memory, factor out a common helper from the user-sysfs region setup forinterleave granularity. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
Creating a CXL region requires userspace intervention through the cxl sysfs files. Type2 support should allow accelerator drivers to create such cxl region from kernel code. Adding that functionality and integrating it with current support for memory expanders. Based on https://lore.kernel.org/linux-cxl/168592159835.1948938.1647215579839222774.stgit@dwillia2-xfh.jf.intel.com/ Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) [jan: Resolve minor conflict due to code lines shift] Signed-off-by: Jiandi An <jan@nvidia.com>
By definition a type2 cxl device will use the host managed memory for specific functionality, therefore it should not be available to other uses. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Davidlohr Bueso <daves@stgolabs.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
Use cxl api for creating a region using the endpoint decoder related to a DPA range. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
A PIO buffer is a region of device memory to which the driver can write a packet for TX, with the device handling the transmit doorbell without requiring a DMA for getting the packet data, which helps reducing latency in certain exchanges. With CXL mem protocol this latency can be lowered further. With a device supporting CXL and successfully initialised, use the cxl region to map the memory range and use this mapping for PIO buffers. Add the disabling of those CXL-based PIO buffers if the callback for potential cxl endpoint removal by the CXL code happens. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…smaller granularities for lower levels The CXL specification supports multi-level interleaving "as long as all the levels use different, but consecutive, HPA bits to select the target and no Interleave Set has more than 8 devices" (from 3.2). Currently the kernel expects that a decoder's "interleave granularity is a multiple of @parent_port granularity". That is, the granularity of a lower level is bigger than those of the parent and uses the outer HPA bits as selector. It works e.g. for the following 8-way config: * cross-link (cross-hostbridge config in CFMWS): * 4-way * 256 granularity * Selector: HPA[8:9] * sub-link (CXL Host bridge config of the HDM): * 2-way * 1024 granularity * Selector: HPA[10] Now, if the outer HPA bits are used for the cross-hostbridge, an 8-way config could look like this: * cross-link (cross-hostbridge config in CFMWS): * 4-way * 512 granularity * Selector: HPA[9:10] * sub-link (CXL Host bridge config of the HDM): * 2-way * 256 granularity * Selector: HPA[8] The enumeration of decoders for this configuration fails then with following error: cxl region0: pci0000:00:port1 cxl_port_setup_targets expected iw: 2 ig: 1024 [mem 0x10000000000-0x1ffffffffff flags 0x200] cxl region0: pci0000:00:port1 cxl_port_setup_targets got iw: 2 ig: 256 state: enabled 0x10000000000:0x1ffffffffff cxl_port endpoint12: failed to attach decoder12.0 to region0: -6 Note that this happens only if firmware is setting up the decoders (CXL_REGION_F_AUTO). For userspace region assembly the granularities are chosen to increase from root down to the lower levels. That is, outer HPA bits are always used for lower interleaving levels. Rework the implementation to also support multi-level interleaving with smaller granularities for lower levels. Determine the interleave set of autodetected decoders. Check that it is a subset of the root interleave. The HPA selector bits are extracted for all decoders of the set and checked that there is no overlap and bits are consecutive. All decoders can be programmed now to use any bit range within the region's target selector. Signed-off-by: Robert Richter <rrichter@amd.com> (backported from https://lore.kernel.org/all/20251028094754.72816-1-rrichter@amd.com/) [jan: Resolved minor conflicts] Signed-off-by: Jiandi An <jan@nvidia.com>
…er definitions PCI: Add CXL DVSEC control, lock, and range register definitions Add register offset and field definitions for CXL DVSEC registers needed by CXL state save/restore across resets: - CTRL2 (offset 0x10) and LOCK (offset 0x14) registers - CONFIG_LOCK bit in the LOCK register - RWL (read-write-when-locked) field masks for CTRL and range base registers. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
… to include/cxl/cxl.h Move CXL HDM decoder register defines, register map structs (cxl_reg_map, cxl_component_reg_map, cxl_device_reg_map, cxl_pmu_reg_map, cxl_register_map), cxl_hdm_decoder_count(), enum cxl_regloc_type, and cxl_find_regblock()/cxl_setup_regs() declarations from internal CXL headers to include/cxl/pci.h. This makes them accessible to code outside the CXL subsystem, in particular the PCI core CXL state save/restore support added in a subsequent patch. No functional change. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Resolve conflicts by moving certain definitions to include/cxl/cxl.h instead of to include/cxl/pci.h to align with its dependency of Alejandro's series] Signed-off-by: Jiandi An <jan@nvidia.com>
…state Add pci_add_virtual_ext_cap_save_buffer() to allocate save buffers using virtual cap IDs (above PCI_EXT_CAP_ID_MAX) that don't require a real capability in config space. The existing pci_add_ext_cap_save_buffer() cannot be used for CXL DVSEC state because it calls pci_find_saved_ext_cap() which searches for a matching capability in PCI config space. The CXL state saved here is a synthetic snapshot (DVSEC+HDM) and should not be tied to a real extended-cap instance. A virtual extended-cap save buffer API (cap IDs above PCI_EXT_CAP_ID_MAX) allows PCI to track this state without a backing config space capability. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
Save and restore CXL DVSEC control registers (CTRL, CTRL2), range base registers, and lock state across PCI resets. When the DVSEC CONFIG_LOCK bit is set, certain DVSEC fields become read-only and hardware may have updated them. Blindly restoring saved values would be silently ignored or conflict with hardware state. Instead, a read-merge-write approach is used: current hardware values are read for the RWL (read-write-when-locked) fields and merged with saved state, so only writable bits are restored while locked bits retain their hardware values. Hooked into pci_save_state()/pci_restore_state() so all PCI reset paths automatically preserve CXL DVSEC configuration. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Resolve minor conflict in drivers/pci/Makefile due to code line shifts ] Signed-off-by: Jiandi An <jan@nvidia.com>
Save and restore CXL HDM decoder registers (global control, per-decoder base/size/target-list, and commit state) across PCI resets. On restore, decoders that were committed are reprogrammed and recommitted with a 10ms timeout. Locked decoders that are already committed are skipped, since their state is protected by hardware and reprogramming them would fail. The Register Locator DVSEC is parsed directly via PCI config space reads rather than calling cxl_find_regblock()/cxl_setup_regs(), since this code lives in the PCI core and must not depend on CXL module symbols. MSE is temporarily enabled during save/restore to allow MMIO access to the HDM decoder register block. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Include <cxl/cxl.h> in drivers/pci/cxl.c due to conflict resolution in "4acbc27592b8 NVIDIA: VR: SAUCE: cxl: Move HDM decoder and register map definitions to include/cxl/cxl.h"] Signed-off-by: Jiandi An <jan@nvidia.com>
…efinitions Add CXL DVSEC register definitions needed for CXL device reset per CXL r3.2 section 8.1.3.1: - Capability bits: RST_CAPABLE, CACHE_CAPABLE, CACHE_WBI_CAPABLE, RST_TIMEOUT, RST_MEM_CLR_CAPABLE - Control2 register: DISABLE_CACHING, INIT_CACHE_WBI, INIT_CXL_RST, RST_MEM_CLR_EN - Status2 register: CACHE_INV, RST_DONE, RST_ERR - Non-CXL Function Map DVSEC register offset Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) [jan: Resolve conflicts where PCI_DVSEC_CXL_CACHE_CAPABLE is already added by "72bd823fb4f1 NVIDIA: VR: SAUCE: PCI: Allow ATS to be always on for CXL.cache capable devices"] Signed-off-by: Jiandi An <jan@nvidia.com>
…_restore() Export pci_dev_save_and_disable() and pci_dev_restore() so that subsystems performing non-standard reset sequences (e.g. CXL) can reuse the PCI core standard pre/post reset lifecycle: driver reset_prepare/reset_done callbacks, PCI config space save/restore, and device disable/re-enable. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
Add infrastructure for quiescing the CXL data path before reset: - Memory offlining: check if CXL-backed memory is online and offline it via offline_and_remove_memory() before reset, per CXL spec requirement to quiesce all CXL.mem transactions before issuing CXL Reset. - CPU cache flush: invalidate cache lines before reset as a safety measure after memory offline. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…XL reset Add sibling PCI function save/disable/restore coordination for CXL reset. Before reset, all CXL.cachemem sibling functions are locked, saved, and disabled; after reset they are restored. The Non-CXL Function Map DVSEC and per-function DVSEC capability register are consulted to skip non-CXL and CXL.io-only functions. A global mutex serializes concurrent resets to prevent deadlocks between sibling functions. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…ration cxl_dev_reset() implements the hardware reset sequence: optionally enable memory clear, initiate reset via CTRL2, wait for completion, and re-enable caching. cxl_do_reset() orchestrates the full reset flow: 1. CXL pre-reset: mem offlining and cache flush (when memdev present) 2. PCI save/disable: pci_dev_save_and_disable() automatically saves CXL DVSEC and HDM decoder state via PCI core hooks 3. Sibling coordination: save/disable CXL.cachemem sibling functions 4. Execute CXL DVSEC reset 5. Sibling restore: always runs to re-enable sibling functions 6. PCI restore: pci_dev_restore() automatically restores CXL state The CXL-specific DVSEC and HDM save/restore is handled by the PCI core's CXL save/restore infrastructure (drivers/pci/cxl.c). Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
Add a "cxl_reset" sysfs attribute to PCI devices that support CXL Reset (CXL r3.2 section 8.1.3.1). The attribute is visible only on devices with both CXL.cache and CXL.mem capabilities and the CXL Reset Capable bit set in the DVSEC. Writing "1" to the attribute triggers the full CXL reset flow via cxl_do_reset(). The interface is decoupled from memdev creation: when a CXL memdev exists, memory offlining and cache flush are performed; otherwise reset proceeds without the memory management. The sysfs attribute is managed entirely by the CXL module using sysfs_create_group() / sysfs_remove_group() rather than the PCI core's static attribute groups. This avoids cross-module symbol dependencies between the PCI core (always built-in) and CXL_BUS (potentially modular). At module init, existing PCI devices are scanned and a PCI bus notifier handles hot-plug/unplug. kernfs_drain() makes sure that any in-flight store() completes before sysfs_remove_group() returns, preventing use-after-free during module unload. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…tribute Document the cxl_reset sysfs attribute added to PCI devices that support CXL Reset. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…and RAS support Add Ubuntu kernel config annotations for CXL-related configs introduced or changed by the following cherry-picked patch series: - drivers/cxl changes between v6.17.9 and upstream 7.0 (which includes a portion of Terry Bowman's v14 CXL RAS series merged via for-7.0/cxl-aer-prep) - Alejandro Lucero's v23 CXL Type-2 device support series - Smita Koralahalli's v6 patch 3/9 (cxl/region: Skip decoder reset on detach for autodiscovered regions) CONFIG_CXL_BUS: Enable CXL bus support built-in; required for CXL Type-2 device and RAS support CONFIG_CXL_PCI: Enable CXL PCI management built-in; auto-selects CXL_MEM; required for CXL Type-2 device support CONFIG_CXL_MEM: Auto-selected by CXL_PCI; required for CXL memory expansion and Type-2 device support CONFIG_CXL_PORT: Required for CXL port enumeration; defaults to CXL_BUS value CONFIG_FWCTL: Selected by CXL_BUS when CXL_FEATURES is enabled; required for CXL feature mailbox access CONFIG_CXL_RAS: New def_bool replacing PCIEAER_CXL (Terry Bowman v14); auto-enabled with ACPI_APEI_GHES+PCIEAER+ CXL_BUS for CXL RAS error handling CONFIG_SFC_CXL: Solarflare SFC9100-family CXL Type-2 device support; not needed for NVIDIA platforms (n) CONFIG_ACPI_APEI_EINJ: Required prerequisite for CONFIG_ACPI_APEI_EINJ_CXL CONFIG_ACPI_APEI_EINJ_CXL: CXL protocol error injection support via APEI EINJ CONFIG_PCIEAER_CXL: Remove it from debian.master policy. This config was removed from Kconfig by upstream commit d18f1b7 (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS) which is included in this port. CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION: Override debian.master amd64-only policy to include arm64. Commit 4d873c5 added 'select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION' to arch/arm64/Kconfig, making this y on arm64 as well. CONFIG_GENERIC_CPU_CACHE_MAINTENANCE: New bool config defined by c460697 in lib/Kconfig. Selected by arm64 via 4d873c5; not selected by x86. Set arm64: y, amd64: -. CONFIG_CACHEMAINT_FOR_HOTPLUG: New optional menuconfig defined by 2ec3b54 in drivers/cache/Kconfig. Depends on GENERIC_CPU_CACHE_MAINTENANCE so becomes visible on arm64. Defaults to n; HiSilicon HHA driver not needed for NVIDIA platforms. Set arm64: n, amd64: -. Signed-off-by: Jiandi An <jan@nvidia.com>
…memory access Override debian.master policy (m->y) for DEV_DAX, DEV_DAX_CXL, and DEV_DAX_KMEM to ensure CXL memory regions are accessible as both raw DAX devices and hotplugged System-RAM nodes. debian.master sets these to 'm' (modules). For NVIDIA platforms with CXL Type-2 devices, built-in (y) is required to ensure CXL memory regions provisioned early in boot are immediately accessible without relying on module loading order. CONFIG_DEV_DAX: Override m->y; prerequisite for DEV_DAX_CXL and DEV_DAX_KMEM to be built-in; depends on TRANSPARENT_HUGEPAGE (already y in debian.master) CONFIG_DEV_DAX_CXL: Override m->y; creates /dev/daxX.Y devices for CXL RAM regions not in the default system memory map (Soft Reserved or dynamically provisioned regions); depends on CXL_BUS+CXL_REGION+DEV_DAX (all y) CONFIG_DEV_DAX_KMEM: Override m->y; onlines CXL DAX devices as System-RAM NUMA nodes via memory hotplug, making CXL memory available for normal kernel and userspace allocation Signed-off-by: Jiandi An <jan@nvidia.com>
…/restore Add Ubuntu kernel config annotation for CONFIG_PCI_CXL introduced by the CXL DVSEC and HDM state save/restore series (Srirangan Madhavan). CONFIG_PCI_CXL: Hidden bool in drivers/pci/Kconfig; auto-enabled when CXL_BUS=y. Gates compilation of drivers/pci/cxl.o which saves and restores CXL DVSEC control/range registers and HDM decoder state across PCI resets and link transitions. Signed-off-by: Jiandi An <jan@nvidia.com>
@JiandiAnNVIDIA
Copy link
Copy Markdown
Author

This patch "PCI: Update CXL DVSEC definitions" missed one rename

nvidia@localhost:/home/nvidia/NV-Kernels$ make CALL scripts/checksyscalls.sh CC drivers/pci/ats.o drivers/pci/ats.c: In function ‘pci_cxl_ats_always_on’: drivers/pci/ats.c:221:44: error: ‘CXL_DVSEC_PCIE_DEVICE’ undeclared (first use in this function); did you mean ‘PCI_DVSEC_CXL_DEVICE’? 221 | CXL_DVSEC_PCIE_DEVICE); | ^~~~~~~~~~~~~~~~~~~~~ | PCI_DVSEC_CXL_DEVICE drivers/pci/ats.c:221:44: note: each undeclared identifier is reported only once for each function it appears in drivers/pci/ats.c:225:45: error: ‘CXL_DVSEC_CAP_OFFSET’ undeclared (first use in this function) 225 | pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET, &cap); | ^~~~~~~~~~~~~~~~~~~~ make[4]: *** [scripts/Makefile.build:287: drivers/pci/ats.o] Error 1 make[3]: *** [scripts/Makefile.build:556: drivers/pci] Error 2 make[2]: *** [scripts/Makefile.build:556: drivers] Error 2 make[1]: *** [/home/nvidia/NV-Kernels/Makefile:2016: .] Error 2 make: *** [Makefile:248: __sub-make] Error 2 

Fixed.

@clsotog
Copy link
Copy Markdown
Collaborator

clsotog commented Mar 24, 2026

I see the compiling issue is fixed. That was my concern.

@nvmochs nvmochs self-requested a review March 24, 2026 19:37
Copy link
Copy Markdown
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the name change fix in "PCI: Update CXL DVSEC definitions" and confirmed it builds successfully for arm64.

No further issues or concerns from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@clsotog clsotog self-requested a review March 24, 2026 19:58
Copy link
Copy Markdown
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented Mar 25, 2026

Tried this on GB300 yesterday with the complilation issue manually fixed. Ran CUDA DVS tests like http://10.112.214.250:8002/. We still need to make sure that there are no regression with older RM driver. With that
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment