Hi,


We have encountered an issue with some Mellanox cards where rdma_bind_addr succeeds but the ib_verbs pointer is NULL which caused spdk to crash when attempting to use this port. The reason for this seems to be an invalid GUID of 0 (bellow is a procedure to re-flash it).


I don't think that this should cause spdk to crash, so I added a patch for review to check that the IB verbs is not NULL after binding - https://review.gerrithub.io/c/spdk/spdk/+/417858


Hope this helps,

Shahar


P.S.


It seems that Mellanox has a "blank_guid" option in its mstflint flash interface, so some manufacturers may provide RNICs with a base GUID of 0. This issue can be fixed by using the same tool to flash a new GUID. We use the MAC to generate it.


#Here is an example of such an RNIC with 0 as a base GUID:

[root@kblock01-knode05 ~]# ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_1              0000000000000001
    mlx5_0              0000000000000000

#First generate the GUID out of the MAC

BASE_MAC=$(mstflint --device=mlx5_0 query | grep "^Base MAC" | tr -s " " | cut -d" " -f 3)

BASE_GUID=$(echo $BASE_MAC | cut -c 1-6)"0300"$(echo $BASE_MAC | cut -c 7-12)


#Now I flash the new GUID:

mstflint --device=mlx5_0 --guid=$BASE_GUID --override_cache_replacement --nofs sg

#The new GUID is not effective until FW reset or power cycle
mlxfwreset --device=/dev/mst/mt4117_pciconf0 --yes reset