Skip to content

fix SIGSEGV when release resources#52

Open
alpha-baby wants to merge 1 commit intoNVIDIA:develfrom
alpha-baby:fujh/fix_repeat_release_device
Open

fix SIGSEGV when release resources#52
alpha-baby wants to merge 1 commit intoNVIDIA:develfrom
alpha-baby:fujh/fix_repeat_release_device

Conversation

@alpha-baby
Copy link

reproduce

Machine information

$nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 N/A NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_bond_0 NIC1: mlx5_bond_1 NIC2: mlx5_bond_2 NIC3: mlx5_bond_3 

NIC info

hca_id: mlx5_bond_0 transport: InfiniBand (0) fw_ver: 32.39.3920 node_guid: e09d:7303:0024:7630 sys_image_guid: e09d:7303:0024:7630 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000001093 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_1 transport: InfiniBand (0) fw_ver: 32.39.3920 node_guid: e09d:7303:0024:79a0 sys_image_guid: e09d:7303:0024:79a0 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000001093 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_2 transport: InfiniBand (0) fw_ver: 32.39.3920 node_guid: e09d:7303:0027:1b08 sys_image_guid: e09d:7303:0027:1b08 vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000001093 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx5_bond_3 transport: InfiniBand (0) fw_ver: 32.39.3920 node_guid: e09d:7303:0024:6fbe sys_image_guid: e09d:7303:0024:6fbe vendor_id: 0x02c9 vendor_part_id: 41692 hw_ver: 0x1 board_id: MT_0000001093 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet 

env config:

NVSHMEM_ENABLE_NIC_PE_MAPPING=1 NVSHMEM_DEBUG_SUBSYS=INIT NVSHMEM_IB_GID_INDEX=3 NVSHMEM_IB_SL=5 NVSHMEM_DEBUG=INFO NVSHMEM_HCA_PE_MAPPING=mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2 NVSHMEM_IB_TRAFFIC_CLASS=16 
NCCL_SOCKET_IFNAME=bond0 NCCL_NET_PLUGIN= NCCL_IB_TIMEOUT=22 NCCL_IB_GID_INDEX=3 NCCL_SET_THREAD_NAME=1 NCCL_DEBUG_SUBSYS=INIT,TUNING,GRAPH NCCL_IB_SL=5 NCCL_IB_TC=136 NCCL_IB_HCA=mlx5_bond NCCL_IB_RETRY_CNT=7 NCCL_IB_QPS_PER_CONNECTION=8 NCCL_DEBUG=INFO 

run deepep log file:
deep_ep_test.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant