nvme drive not showing in vm in spdk
by Nitin Gupta
Hi All
i am new in spdk development and currently doing spdk setup in that was
able to setup back-end storage with NVME .After running the VM with
following command , there is no nvme drive present .
/usr/local/bin/qemu-system-x86_64 -m 1024 -object
memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on
-nographic -no-user-config -nodefaults -serial
mon:telnet:localhost:7704,server,nowait -monitor
mon:telnet:localhost:8804,server,nowait -numa node,memdev=mem -drive
file=/home/qemu/qcows,format=qcow2,if=none,id=disk -device
ide-hd,drive=disk,bootindex=0 -chardev socket,id=char0,path=./spdk/vhost.0
-device vhost-user-scsi-pci,id=scsi0,chardev=char0 --enable-kvm
how to identify which is nvme drive ?
is there any way to enable nvme from qemu command ?
PS: i have already specified the nvme drive in vhost.conf.in
Regards
Nitin
3 years, 4 months
Two Questions/Ideas about SPDK Framework (NVMf/e and RPC)
by 松本周平 / MATSUMOTO,SHUUHEI
Hi,
These are very unshaped questions/ideas yet because I have not a few things to do for iSCSI/SCSI now, but I would like to hear your thought at this stage.
1)
If NVMe_backend_driver and NVMf_target_driver run on the same CPU,
we may be able to do the end-to-end run-to-completion model and may get some benefit from locality.
(https://github.com/stanford-mast/reflex is very interesting for me.)
As long as I understand, we should locate NVMe_backend_driver and NVMf_target_driver on the different CPU, respectively.
NVMf_target_driver is in the SPDK poller framework but NVMe_backend_driver is not.
Is it reasonable to put some functions of NVMe_backend_driver into the SPDK poller framework?
Or when we do iSCSI target based on DPDK in future, we put some functions of NVMe_backend_driver into the DPDK event driven framework?
2)
Currently it is difficult for the RPC handler to use semaphore in the middle, hence I have proposed one idea (https://review.gerrithub.io/#/c/379941/) as a workaround.
Related with this, as Jim taught me, VHOST-SCSI thread is outside of SPDK and can use semaphore to do synchronous operation.
To support complex operation in RPC, I think there are at least three approaches:
a) https://review.gerrithub.io/#/c/379941/
b) support asynchronous RPC reply by using callback or event.
b) RPC handler is outside of SPDK threads and communicate with SPDK thread through IPC.
I would like to propose b) if this looks reasonable.
I’m afraid my explanation is not enough and thank you for your patience in advance.
Thank you,
Shuhei Matsumoto
3 years, 4 months
messy code for log
by liupan1234
Hi,
We developed an app for our use model, and enabled log by referring the code in app/iscsi_tgt, but got the messy code in /var/log/message:
Oct 10 19:29:25 e07d07265.eu6sqa spdk[9371]: device_manager.c: 172:dm_set_log_level: *NOTICE*: Device manager log level:2
Oct 10 19:29:25 e07d07265.eu6sqa spdk[9371]: device_manager.c: 117:check_only_one_instance: *NOTICE*: notice
Oct 10 19:29:25 e07d07265.eu6sqa È<87>QXù^?[9371]: EAL: Probing VFIO support...
Oct 10 19:29:26 e07d07265.eu6sqa su[9368]: pam_unix(su:session): session closed for user root
Oct 10 19:29:26 e07d07265.eu6sqa Pë0^A[9371]: device_manager.c: 503:main: *ERROR*: Initializing NVMe Controllers
Oct 10 19:29:26 e07d07265.eu6sqa Pë0^A[9371]: EAL: PCI device 0000:03:00.0 on NUMA socket 0
Oct 10 19:29:26 e07d07265.eu6sqa Pë0^A[9371]: EAL: probe driver: 8086:953 spdk_nvme
Oct 10 19:29:26 e07d07265.eu6sqa Pë0^A[9371]: device_manager.c: 260:dm_probe_cb: *NOTICE*: Attaching to 0000:03:00.0
If we don't enable log in our app, and only dump dpdk log, there is no issue.
Could you give me some help?
Th
3 years, 4 months
"context" of a bdev module
by Fenggang Wu
Hi,
I am working on implementing vbdev_agg block dev module, which can
aggregate multiple base device into a virtual bdev by means of striping.
I got confused about the "context" of a module when implementing the
vbdev_agg_get_ctx_size() function.
I've checked other bdev modules for the get_ctx_size_fn implementation.
Some returns the size of a io struct (e.g. bdev_nvme, bdev_aio, etc.), and
some returns the size of a io_channel pointer (e.g. vbdev_split, vbdev_gpt,
etc.).
I've read through the code and know the get_ctx_size is called in
spdk_bdev_get_max_ctx_size(). Still, I haven't figure out when and how this
space is allocated or used for.
So my questions are:
1) When is this space of "context" allocated? What is it used for? Is it
channel specific? or io specific? I would guess either... Does it have
anything to do with the instance of the aggregated virtual device (i.e. the
struct representing it)?
2) If I were to define a context for the vbdev_agg module, how should it
look like?
Thanks!
Regards,
Fenggang
---------- Forwarded message ---------
From: Fenggang Wu <fenggang(a)cs.umn.edu>
Date: Tue, Oct 10, 2017 at 3:53 PM
Subject: Re: [SPDK] Understanding io_channel
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Cc: wuxx0835(a)umn.edu <wuxx0835(a)umn.edu>
Hi Jim,
Thank you very much for the great answer! It makes perfect sense to me.
This saves so much time.
On Mon, Oct 9, 2017 at 7:04 PM Harris, James R <james.r.harris(a)intel.com>
wrote:
> Hi Fenggang,
>
> > On Oct 9, 2017, at 12:04 PM, Fenggang Wu <fenggang(a)cs.umn.edu> wrote:
> >
> > Hi,
> >
> > I am new to SPDK and trying to develop an aggregated virtual block
> device module (vbdev_agg.c) that stripes across multiple base devices. I am
> having difficulty understanding
>
> Welcome to SPDK! An aggregated virtual block device module is interesting
> - will this do striping and/or concatenation?
>
> Current I am only considering striping. But I would expect an easy
extension from striping to concatenation.
> > the general physical meaning of the io_channel. And particularly in the
> vbdev_agg case, how can I define the io_channel for this aggregated device?
> Or more specifically, what is the right way to implementing the
> get_io_channel function in the vbdev_agg module?
> >
> > My current understanding is each io_device can have many separate
> io_channels, each allocated for one thread. However, I/O requests issued to
> the bdev_agg will be forwarded to the base device’s io_channel anyway
> (after some offset translation), where the io_channel of the vbdev_agg is
> not used.
>
> The vbdev_agg I/O channel is basically a place for you to store the I/O
> channels for the base device *for that thread*.
>
> For example, an I/O channel for an nvme block device corresponds to an
> NVMe queue pair. If this nvme block device is accessed from two different
> threads, those two threads will have two separate I/O channels, each
> channel associated with its own NVMe queue pair. This ensures that all
> NVMe hardware accesses are done completely lock-free, since only one thread
> operates on any given queue pair.
>
> If you roll this up to your vbdev_agg block device, you may (and likely
> will) have it accessed from two or more different threads. So you will
> need vbdev_agg I/O channels which will effectively just be a placeholder
> for the underlying base block device I/O channels. You may have an I/O
> channel on thread 0, which contains pointers to the base bdev channels for
> thread 0. Another I/O channel on thread 1 will contain pointers to the
> base bdev channels for thread 1.
>
> >
> > I’ve tried to return NULL in the get_io_channel function of the
> vbdev_agg module. It works fine for the read, write, unmap, and flush
> functions I implemented in the vbdev_agg module. The vbdev_agg I/O
> functions (read, write, unmap, flush) forward the I/O request to the
> underlying base device by calling spdk_bdev_{read, write, unmap, flush}
> again to the corresponding base device after the offset translation
> (defined by striping). In the call back of the completion of the base
> devices, I call the completion of the io request for the agg device.
>
> I suspect that NULL is working in some of these cases, because you have
> already allocated I/O channels for the base device and have stored them in
> an internal global data structure. But this is just a hypothesis. This
> works OK for single-threaded use cases, but when a vbdev_agg block device
> gets accessed from multiple threads, you will need separate I/O channels
> for the base device too - one for each thread.
>
> >
> > However, other part of the code sometime generate segment fault because
> of this NULL channel (e.g. in spdk_put_io_channel(), when accessing
> ch->channel, as ch is null pointer). So I am just wondering what is the
> right way of defining the vbdev_agg’s io_channel.
> >
> > Also I found in the comments in struct spdk_io_channel:
> > "Modules will allocate extra memory off the end of this structure to
> store references to hardware-specific references (i.e. NVMe queue pairs, or
> references to child device spdk_io_channels (i.e. virtual bdevs)."
> > However, I haven't figure out how to use it. Is there any code that
> exemplifies this?
>
> nvme is a good example - lib/bdev/nvme/bdev_nvme.c. Look for struct
> nvme_io_channel. This is the context buffer for an I/O channel for an nvme
> block device.
>
> Next look for the call to spdk_io_device_register(). The first parameter
> is a unique pointer (it just uses the address of the nvme_ctrlr structure)
> - it just needs to be a pointer value that we know is unique within the
> application. The next two parameters are function pointers for creating
> and destroying I/O channels for this device. The last parameter is the
> size of the per-I/O channel context buffer.
>
> The SPDK io_channel module keeps a reference count on all of the I/O
> channels. If there are multiple requests for an I/O channel for the NVMe
> controller on one thread, we do not need or want to allocate a separate
> NVMe queue pair for each request - we want them to share the same I/O
> channel. So the create function pointer gets called when the I/O channel
> is first allocated, and the destroy function pointer is not called until
> the last reference to the I/O channel is released.
>
> You will probably also want to look at bdev_nvme_create_cb and
> bdev_nvme_destroy_cb. The former shows how an I/O channel for this device
> is created - it allocates an I/O qpair for the NVMe controller and then
> starts a poller to poll for completions on that queue pair. The latter
> frees the I/O queue pair and stops the poller once the I/O channel is
> destroyed.
>
Now I have learned from the nvme module and register my agg_disk struct as
the void* io_device, or better named, the "unique pointer". Currenly, a
space with a size of a array of io_channel pointers is allocated after the
io_channel struct. The io_channels pointers of the base devices are kept in
the array. They are got (get_io_channel(base_dev)) in the create_cb and put
(put_io_channel(base_dev)) in the destroy_cb.
>
> >
> > Any suggestions/hints will be appreciated. Thank you very much!
>
> If you would like to post your module to GerritHub, I’m sure you’d get
> some good review feedback from myself and others. Please note that this is
> a very active area of development right now. Your questions are really
> appreciated and will help us clarify where we need to improve on example
> code and documentation.
>
>
Personally I would like to share it or even make some contribution to the
community if possible. Yet I would have to double check with the industry
partner supporting my project to see their opinions.
Besides, I've also got a separate question about the context of the module.
I will ask it in a separate thread soon.
Thanks again!
-Fenggang
Thanks,
>
> -Jim
>
>
> >
> > Regards,
> > Fenggang
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
3 years, 4 months
Understanding io_channel
by Fenggang Wu
Hi,
I am new to SPDK and trying to develop an aggregated virtual block device
module (vbdev_agg.c) that stripes across multiple base devices. I am having
difficulty understanding the general physical meaning of the io_channel.
And particularly in the vbdev_agg case, how can I define the io_channel for
this aggregated device? Or more specifically, what is the right way to
implementing the get_io_channel function in the vbdev_agg module?
My current understanding is each io_device can have many separate
io_channels, each allocated for one thread. However, I/O requests issued to
the bdev_agg will be forwarded to the base device’s io_channel anyway
(after some offset translation), where the io_channel of the vbdev_agg is
not used.
I’ve tried to return NULL in the get_io_channel function of the vbdev_agg
module. It works fine for the read, write, unmap, and flush functions I
implemented in the vbdev_agg module. The vbdev_agg I/O functions (read,
write, unmap, flush) forward the I/O request to the underlying base device
by calling spdk_bdev_{read, write, unmap, flush} again to the corresponding
base device after the offset translation (defined by striping). In the call
back of the completion of the base devices, I call the completion of the io
request for the agg device.
However, other part of the code sometime generate segment fault because of
this NULL channel (e.g. in spdk_put_io_channel(), when accessing
ch->channel, as ch is null pointer). So I am just wondering what is the
right way of defining the vbdev_agg’s io_channel.
Also I found in the comments in struct spdk_io_channel:
"Modules will allocate extra memory off the end of this structure to store
references to hardware-specific references (i.e. NVMe queue pairs, or
*references
to child device spdk_io_channels* (i.e. virtual bdevs)."
However, I haven't figure out how to use it. Is there any code that
exemplifies this?
Any suggestions/hints will be appreciated. Thank you very much!
Regards,
Fenggang
3 years, 4 months
Re: [SPDK] Looking for help with SPDK
by Victor Banh
Hi Jim
I have SPDK NVMeoF and keep getting error with bigger block size with fio on randwrite tests on client server.
I am using Ubuntu 16.04 with kernel version 4.12.0-041200-generic on target and client.
The DPDK is 17.08 and SPDK is 17.07.1.
Thanks
Victor
[ 1806.042843] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042859] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042868] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042873] ldm_validate_partition_table(): Disk read failed.
[ 1806.042879] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042886] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042894] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042902] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042906] Dev nvme2n1: unable to read RDB block 0
[ 1806.042913] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042920] Buffer I/O error on dev nvme2n1, logical block 0, async page read
[ 1806.042932] Buffer I/O error on dev nvme2n1, logical block 3, async page read
[ 1806.042947] nvme2n1: unable to read partition table
[ 1806.090850] ldm_validate_partition_table(): Disk read failed.
[ 1806.090872] Dev nvme2n1: unable to read RDB block 0
[ 1806.090905] nvme2n1: unable to read partition table
3 years, 4 months
Buffer I/O error on bigger block size running fio
by Victor Banh
Hi
I have SPDK NVMeoF and keep getting error with bigger block size with fio on randwrite tests.
I am using Ubuntu 16.04 with kernel version 4.12.0-041200-generic on target and client.
The DPDK is 17.08 and SPDK is 17.07.1.
Thanks
Victor
[46905.233553] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[48285.159186] blk_update_request: I/O error, dev nvme1n1, sector 2507351968
[48285.159207] blk_update_request: I/O error, dev nvme1n1, sector 1301294496
[48285.159226] blk_update_request: I/O error, dev nvme1n1, sector 1947371168
[48285.159239] blk_update_request: I/O error, dev nvme1n1, sector 1891797568
[48285.159252] blk_update_request: I/O error, dev nvme1n1, sector 10833824
[48285.159265] blk_update_request: I/O error, dev nvme1n1, sector 614937152
[48285.159277] blk_update_request: I/O error, dev nvme1n1, sector 1872305088
[48285.159290] blk_update_request: I/O error, dev nvme1n1, sector 1504491040
[48285.159299] blk_update_request: I/O error, dev nvme1n1, sector 1182136128
[48285.159308] blk_update_request: I/O error, dev nvme1n1, sector 1662985792
[48285.191185] nvme nvme1: Reconnecting in 10 seconds...
[48285.191254] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191291] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191305] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191314] ldm_validate_partition_table(): Disk read failed.
[48285.191320] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191327] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191335] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191342] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191347] Dev nvme1n1: unable to read RDB block 0
[48285.191353] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191360] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[48285.191375] Buffer I/O error on dev nvme1n1, logical block 3, async page read
[48285.191389] nvme1n1: unable to read partition table
[48285.223197] nvme1n1: detected capacity change from 1600321314816 to 0
[48289.623192] nvme1n1: detected capacity change from 0 to -65647705833078784
[48289.623411] ldm_validate_partition_table(): Disk read failed.
[48289.623447] Dev nvme1n1: unable to read RDB block 0
[48289.623486] nvme1n1: unable to read partition table
[48289.643305] ldm_validate_partition_table(): Disk read failed.
[48289.643328] Dev nvme1n1: unable to read RDB block 0
[48289.643373] nvme1n1: unable to read partition table
3 years, 4 months
Re: [SPDK] blobstore metadata questions, comments and potential issues
by Harris, James R
Hey Paul,
The masks on disk are solely an optimization during load. We only write the masks out to disk when a Blobstore is unloaded. If Blobstore is not unloaded cleanly, we can rebuild both masks with a full walk of all of the metadata pages. This avoids an extra page write on every metadata operation (create/delete/resize/xattr).
I agree that we should defer clearing the clean bit until the first metadata operation. That is not in the I/O path so an extra step on create/delete/resize/xattr would not be cumbersome.
-Jim
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Paul E Luse <paul.e.luse(a)intel.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Monday, October 2, 2017 at 2:00 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] blobstore metadata questions, comments and potential issues
Guys-
Going through the finer details on MD handling (for the first time so forgive misunderstandings) before looking more closely at Cunyn’s patch (https://review.gerrithub.io/#/c/376470/ ) and I’m seeing some unexpected, well IMHO, things that I wanted to cover real quick. Mostly they relate to my very early comments on the patch where we never updated the clean bit in MD.
Forget blobstore or the patch for a minute, in general I would argue that the most efficient way to manage clean/dirty metadata in a system like this is:
- When you init the system you are clean by definition so your on-disk and in-memory structs should match and your ‘clean’ bit should be 1 in both cases
- When you encounter an event that requires a metadata change:
o If your in-memory clean==0 do nothing wrt on disk sync’ing because you’re already dirty
o If your in-memory clean==1 then set to 0 and write the on-disk clean to 0 so that (a) you know you don’t need to update on disk clean w/next MD update and (b) you are protected from power fail because on disk clean is now 1. Do not sync metadata, this is just an update to that bit (or in our case a superblock-only write)
- In each of these following scenarios do a full sync of the metadata and set clean to 1 both in memory and on disk
o Just loaded or initialized the system
o Unloading the system
o Just recovered the system after loading and recognizing that clean was 0 so you rebuilt everything
So when I look through blobstore and see the following I can certainly continue to review the patch as it builds on this but would like to follow it up with a patch to make it work like above so that we don’t have windows where we are dirty for no reason.
But, before that I also have a few questions for things I just don’t understand.
- When we init a BS we claim the MD used cluster in memory but never sync. I assume this is not an oversight because clean is 0 (but if we do the stuff I mention above this would end with a superblock only update to set clean to 1)
- When we load a BS, (with the patch) we are going to immediately change clean to 0 (assuming we didn’t come up dirty) and then turn right around and not only update the superblock to set clean to 0 (just explained how I feel about that) but then we recalc and then rewrite, to disk the masks that we just read when nothing could have changed? If I read this correctly, besides my comment about us not actually being dirty yet, why are we updating the on disks masks again?
- When an app calls spdk_bs_md_sync_blob after a resize or something, I don’t see where the masks on disk immediately following the superblock are ever updated. It appears at least that we are only updating he metadata for the blob itself and the in memory masks (in the bs struct) but not the on disk masks. So any scenario where we do a resize & sync (or xattr and sync) and have a dirty shutdown, even with the recovery code, is going to end up corrupted right? It seems like the on disk masks that are stored just after the superblock need to be written as part of spdk_bs_md_sync_blob() and from what I can tell they’re only updated with unload today. This is, of course, independent of Cunyin’s patch but I wanted to ask about it since that’s the context I’m looking at it from
Sorry for the length email and all the claims/questions. I’m certain I have mis-read some of this stuff at least but w/o understanding these points I can’t really responsibly review Cunyin’s patch which is critical I think… well, his patch is critical I mean, not my review of it ☺
-Paul
3 years, 4 months
blobstore metadata questions, comments and potential issues
by Luse, Paul E
Guys-
Going through the finer details on MD handling (for the first time so forgive misunderstandings) before looking more closely at Cunyn's patch (https://review.gerrithub.io/#/c/376470/ ) and I'm seeing some unexpected, well IMHO, things that I wanted to cover real quick. Mostly they relate to my very early comments on the patch where we never updated the clean bit in MD.
Forget blobstore or the patch for a minute, in general I would argue that the most efficient way to manage clean/dirty metadata in a system like this is:
- When you init the system you are clean by definition so your on-disk and in-memory structs should match and your 'clean' bit should be 1 in both cases
- When you encounter an event that requires a metadata change:
o If your in-memory clean==0 do nothing wrt on disk sync'ing because you're already dirty
o If your in-memory clean==1 then set to 0 and write the on-disk clean to 0 so that (a) you know you don't need to update on disk clean w/next MD update and (b) you are protected from power fail because on disk clean is now 1. Do not sync metadata, this is just an update to that bit (or in our case a superblock-only write)
- In each of these following scenarios do a full sync of the metadata and set clean to 1 both in memory and on disk
o Just loaded or initialized the system
o Unloading the system
o Just recovered the system after loading and recognizing that clean was 0 so you rebuilt everything
So when I look through blobstore and see the following I can certainly continue to review the patch as it builds on this but would like to follow it up with a patch to make it work like above so that we don't have windows where we are dirty for no reason.
But, before that I also have a few questions for things I just don't understand.
- When we init a BS we claim the MD used cluster in memory but never sync. I assume this is not an oversight because clean is 0 (but if we do the stuff I mention above this would end with a superblock only update to set clean to 1)
- When we load a BS, (with the patch) we are going to immediately change clean to 0 (assuming we didn't come up dirty) and then turn right around and not only update the superblock to set clean to 0 (just explained how I feel about that) but then we recalc and then rewrite, to disk the masks that we just read when nothing could have changed? If I read this correctly, besides my comment about us not actually being dirty yet, why are we updating the on disks masks again?
- When an app calls spdk_bs_md_sync_blob after a resize or something, I don't see where the masks on disk immediately following the superblock are ever updated. It appears at least that we are only updating he metadata for the blob itself and the in memory masks (in the bs struct) but not the on disk masks. So any scenario where we do a resize & sync (or xattr and sync) and have a dirty shutdown, even with the recovery code, is going to end up corrupted right? It seems like the on disk masks that are stored just after the superblock need to be written as part of spdk_bs_md_sync_blob() and from what I can tell they're only updated with unload today. This is, of course, independent of Cunyin's patch but I wanted to ask about it since that's the context I'm looking at it from
Sorry for the length email and all the claims/questions. I'm certain I have mis-read some of this stuff at least but w/o understanding these points I can't really responsibly review Cunyin's patch which is critical I think... well, his patch is critical I mean, not my review of it :)
-Paul
3 years, 4 months