Best practices on driver binding for SPDK in production environments
by Lance Hartmann ORACLE
This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
(Event 1)
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
ACTION=add
DEVNAME=/dev/nvme0
…
SUBSYSTEM=nvme
(Event 2)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
ACTION=bind
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0
DRIVER=nvme
…
SUBSYSTEM=pci
(Event 3)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
ACTION=add
DEVPATH=/devices/virtual/bdi/259:0
...
SUBSYSTEM=bdi
(Event 4)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
ACTION=add
DEVNAME=/dev/nvme0n1
DEVPATH=/devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1
DEVTYPE=disk
...
SUBSYSTEM=block
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
--
Lance Hartmann
lance.hartmann(a)oracle.com
1 year, 5 months
nvme_driver_init() of secondary failed if primary doesn't connect any nvme ctrl
by wuzhouhui
Hi,
I encountered an issue when connecting SPDK NVMoF target in a secondary SPDK
instance. The reason is quite simple, but I'm not sure whether it should be
treated as error. If the answer is yes, I will submit a issue in GitHub if
necessary.
Reproduce steps:
1. Start a NVMoF/TCP target in remote host
2. In local host, do
2.1 Start primary SPDK instance
2.2 Start secondary SPDK instance
2.3 For secondary SPDK instance, construct a nvme bdev that connect to
NVMoF/TCP target
Results:
Construction of nvme bdev failed, the log says:
nvme.c: 360:nvme_driver_init: *ERROR*: primary process is not started yet
bdev_nvme.c:1314:spdk_bdev_nvme_create: *ERROR*: No controller was found with provided trid (traddr: [snip])
Thanks.
1 year, 6 months
Re: [SPDK] [Qemu-devel] Qemu migration with vhost-user-blk on top of local storage
by wuzhouhui
> -----Original Messages-----
> From: "Stefan Hajnoczi" <stefanha(a)gmail.com>
> Sent Time: 2019-01-09 20:42:58 (Wednesday)
> To: wuzhouhui <wuzhouhui14(a)mails.ucas.ac.cn>
> Cc: qemu-devel(a)nongnu.org, xieyongji(a)baidu.com, lilin24(a)baidu.com, libvir-list(a)redhat.com, spdk(a)lists.01.org
> Subject: Re: [Qemu-devel] Qemu migration with vhost-user-blk on top of local storage
>
> On Wed, Jan 09, 2019 at 06:23:42PM +0800, wuzhouhui wrote:
> > Hi everyone,
> >
> > I'm working qemu with vhost target (e.g. spdk), and I attempt to migrate VM with
> > 2 local storages. One local storage is a regular file, e.g. /tmp/c74.qcow2, and
> > the other is a malloc bdev that spdk created. This malloc bdev will exported to
> > VM via vhost-user-blk. When I execute following command:
> >
> > virsh migrate --live --persistent --unsafe --undefinesource --copy-storage-all \
> > --p2p --auto-converge --verbose --desturi qemu+tcp://<uri>/system vm0
> >
> > The libvirt reports:
> >
> > qemu-2.12.1: error: internal error: unable to execute QEMU command \
> > 'nbd-server-add': Cannot find device=drive-virtio-disk1 nor \
> > node_name=drive-virtio-disk1
>
> Please post your libvirt domain XML.
My libvirt is based on libvirt-1.1.1-29.el7, and add many patches to satisfy our
own needs, e.g. add support for vhost-user-blk. Post domain xml may not useful.
Anyway, following is full contents of XML:
<domain type='kvm'>
<name>wzh</name>
<uuid>a84e96e6-2c53-408d-986b-c709bc6a0e51</uuid>
<memory unit='MiB'>4096</memory>
<memoryBacking>
<hugepages/>
</memoryBacking>
<currentMemory unit='MiB'>4096</currentMemory>
<vcpu placement='static' cpuset='16-31'>2</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
</features>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/data/wzh/x86_64-softmmu/qemu-system-x86_64</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/data/wzh/c74.qcow2'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
</disk>
<disk type='vhost-user-blk' device='disk'>
<source type='unix' path='/var/tmp/lv0' mode='client'>
</source>
<target dev='vdb' bus='virtio'/>
<driver queues='4'/>
</disk>
<controller type='usb' index='0'>
<alias name='usb0'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<serial type='pty'>
<target port='0'/>
<alias name='serial0'/>
</serial>
<serial type='pty'>
<target port='1'/>
<alias name='serial1'/>
</serial>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' autoport='yes' listen='0.0.0.0' keymap='en-us'>
<listen type='address' address='0.0.0.0'/>
</graphics>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
</video>
</devices>
<seclabel type='none'/>
</domain>
>
> > Does it means that qemu with spdk on top of local storage don't support migration?
> >
> > QEMU: 2.12.1
> > SPDK: 18.10
>
> vhost-user-blk bypasses the QEMU block layer, so NBD storage migration
> at the QEMU level will not work for the vhost-user-blk disk.
>
> Stefan
1 year, 6 months
Chandler Build Pool Test Failures
by Howell, Seth
Hi all,
There has been a rash of failures on the test pool starting last night. I was able to root cause the failures to a point in the NVMe-oF shutdown tests. The main substance of the failure is that QAT and the DPDK framework don't always play well with secondary dpdk processes. In the interest of avoiding these failures on future builds, please rebase your changes on the following patch series which includes the fix of not running bdevperf as a secondary process in the NVMe-oF shutdown tests.
https://review.gerrithub.io/c/spdk/spdk/+/435937/6
Thanks,
Seth Howell
1 year, 6 months
lvol_store restored even if base raid not created for single-base raid
by wuzhouhui
Hi,
Assume that:
1. construct raid_bdev on a single nvme_bdev
2. construct lvol_store on the raid_bdev
3. restart spdk app
After spdk app started and nvme_bdev constructed, the lvol_store will
restored even if raid_bdev not constructed. In other words, the
single-base raid_bdev just like a passthru_bdev. What I want is the
lvol_store will not restored untill base bdev (raid_bdev, in this
example) is constructed. Is it possible?
Thanks.
1 year, 6 months
VM boot failed sometimes if using vhost-user-blk with spdk
by wuzhouhui
I'm using following command line to start VM (The /var/tmp/vhost.0 connected to a
malloc bdev (16 MB ) created by SPDK):
/home/wuzhouhui/qemu-2.12.1/x86_64-softmmu/qemu-system-x86_64 \
-name guest=wzh,debug-threads=on \
-machine pc-i440fx-2.12,accel=kvm,usb=off \
-cpu host \
-m 4096 \
-object memory-backend-file,id=mem0,size=4096M,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem0 \
-realtime mlock=off \
-smp 2,sockets=2,cores=1,threads=1 \
-uuid a84e96e6-2c53-408d-986b-c709bc6a0e51 \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device ahci,id=sata0,bus=pci.0,addr=0x4 \
-drive file=/home/wuzhouhui/wzh.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-k en-us \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on \
-vnc :9 \
-chardev socket,id=char0,path=/var/tmp/vhost.0 \
-device vhost-user-blk-pci,id=blk0,chardev=char0,num-queues=4 \
But most of the time, VM boot failed with following message in vnc screen:
Warning: /dev/disk/by-uuid/e0dcaf0c-bc23-4df6-b2cd-d40aa1bbb0b5 does not exist
Generating "/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
There is some message from /run/initramfs/rdsosreport.txt:
...
system-udevd[196]: symlink '../../vdb' '/dev/disk/by-id/virtio-Malloc.tmp-b253:16' failed: Not a directory
...
I checked /dev/disk, it should be a directory, but it is a symlink now:
:/# ls -l /dev/disk
lrwxrwxrwx 1 root 0 3 Oct 30 03:08 /dev/disk -> vdb
If I just remove:
-chardev socket,id=char0,path=/var/tmp/vhost.0 \
-device vhost-user-blk-pci,id=blk0,chardev=char0,num-queues=4 \
The VM will boot normally.
Does anyone have encountered similar issue like this?
Host OS: CentOS 7.3, with kernel 3.10.0-862.11.6.el7.x86_64
Guest OS: CentOS 7.5
Qemu: 2.12.1
SPDK: f0cb7b871e in master
1 year, 6 months
assert() do not work in CentOS 7.4
by wuzhouhui
Hi,
When I use assert() to determine the return value of function, like:
assert(foo() == 0);
I found that the foo() not be called at all. After preprocessed by cpp, I
found that the previous statement is replaced by:
((void) (0));
You can see this result by changing SPDK a bit:
diff --git a/app/vhost/Makefile b/app/vhost/Makefile
index ef75e5e..99cd401 100644
--- a/app/vhost/Makefile
+++ b/app/vhost/Makefile
@@ -53,6 +53,7 @@ LIBS += $(ENV_LINKER_ARGS)
all : $(APP)
@:
+ @cpp $(CFLAGS) vhost.c -o vhost.cpp
$(APP) : $(OBJS) $(SPDK_LIB_FILES) $(ENV_LIBS) $(BLOCKDEV_MODULES_FILES) $(COPY_MODULES_FILES) $(SOCK_MODULES_FILES)
$(LINK_C)
diff --git a/app/vhost/vhost.c b/app/vhost/vhost.c
index af0ece1..05fc72f 100644
--- a/app/vhost/vhost.c
+++ b/app/vhost/vhost.c
@@ -112,7 +112,7 @@ main(int argc, char *argv[])
}
/* Blocks until the application is exiting */
- rc = spdk_app_start(&opts, vhost_started, NULL, NULL);
+ assert((rc = spdk_app_start(&opts, vhost_started, NULL, NULL)) == 0);
spdk_app_fini();
Then, just typing "make". The gcc will warning:
vhost.c:92:1: warning: ‘vhost_started’ defined but not used [-Wunused-function]
And open vhost.cpp, you will see the previous assert() statement becomes:
if (g_pid_path) {
save_pid(g_pid_path);
}
((void) (0));
spdk_app_fini();
return rc;
Does this means that all assert() in source code do not work at all? I don't
know whether it should be an issue.
OS: CentOS 7.4
SPDK: 8166215677 in master
gcc: gcc-4.8.5-28.el7_5.1.x86_64
1 year, 6 months
Qemu failed if specify reconnect=1 for vhost-user-blk socket
by wuzhouhui
Hi, all.
I'm using Qemu with SPDK. When specify option "reconnect=1" for
vhost-user-blk socket, the Qemu will exited with error:
2018-08-17T03:39:21.768809Z qemu-system-x86_64: -device vhost-user-blk-pci,id=blk0,chardev=char0,num-queues=4: Failed to set msg fds.
2018-08-17T03:39:21.768893Z qemu-system-x86_64: -device vhost-user-blk-pci,id=blk0,chardev=char0,num-queues=4: vhost-user-blk: vhost initialization failed: Operation not permitted
The whole command line is:
/root/wzh/qemu-2.12.0/build/x86_64-softmmu/qemu-system-x86_64 \
-name guest=wzh,debug-threads=on \
-machine pc-i440fx-2.12,accel=kvm,usb=off,dump-guest-core=off \
-cpu host \
-m 2048 \
-object memory-backend-file,id=mem0,size=2G,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem0 \
-realtime mlock=off \
-smp 32,sockets=16,cores=2,threads=1 \
-uuid a84e96e6-2c53-408d-986b-c709bc6a0e51 \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device ahci,id=sata0,bus=pci.0,addr=0x4 \
-drive file=/root/wzh/wzh.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-vnc :0 \
-k en-us \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-msg timestamp=on \
-chardev socket,id=char0,path=/var/tmp/vhost.0,reconnect=1 \
-device vhost-user-blk-pci,id=blk0,chardev=char0,num-queues=4
The Qemu works fine if I just remove option "reconnect=1". Does Qemu doesn't support this
option for SPDK's vhost-user-blk, or did I made a stupid mistake? Thanks for any advice.
OS: CentOS 7.4
Qemu: 2.12.0
SPDK: 9ee494213 in master
1 year, 6 months