On 23/11/18 10:27 π.μ., Wodkowski, PawelX wrote:
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces@lists.01.org] On Behalf Of Nikos Dragazis
> Sent: Thursday, November 22, 2018 7:52 PM
> To: spdk(a)lists.01.org; Stojaczyk, Dariusz <dariusz.stojaczyk(a)intel.com>
> Subject: Re: [SPDK] Questions about vhost memory registration
>
>
> On 12/11/18 1:48 μ.μ., Wodkowski, PawelX wrote:
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces@lists.01.org] On Behalf Of Nikos
> Dragazis
>>> Sent: Saturday, November 10, 2018 3:37 AM
>>> To: spdk(a)lists.01.org
>>> Subject: Re: [SPDK] Questions about vhost memory registration
>>>
>>> On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
>>>>> -----Original Message-----
>>>>> From: SPDK [mailto:spdk-bounces@lists.01.org] On Behalf Of Nikos
>>> Dragazis
>>>>> Sent: Thursday, November 8, 2018 1:49 AM
>>>>> To: spdk(a)lists.01.org
>>>>> Subject: [SPDK] Questions about vhost memory registration
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to raise a couple of questions about vhost target.
>>>>>
>>>>> My first question is:
>>>>>
>>>>> During vhost-user negotiation, the master sends its memory regions
to
>>>>> the slave. Slave maps each region in its own address space. The
mmap
>>>>> addresses are page aligned (that is 4KB aligned) but not
necessarily
> 2MB
>>>>> aligned. When vhost registers the memory regions in
>>>>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to
> 2MB
>>>>> here:
>>>> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require
> that
>>> initiator
>>>> pass memory backed by huge pages >= 2MB in size. On x86 MMU this
> imply
>>>> that page alignment is the same as page size which is >= 2MB (99% sure
-
>>>> can someone confirm this to get this +1% ;) ).
>>> Yes, you are probably right. I didn’t know how the kernel achieves
>>> having a single page table entry for a contiguous 2MB virtual address
>>> range. If I get this right, in case of x86_64, the answer is using a
>>> page middle directory (PMD) entry pointing directly to a 2MB physical
>>> page rather than to a lower-level page table. And since the PMDs are 2MB
>>> aligned by definition, the resulting virtual address will be 2MB
>>> aligned.
>>>>>
https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L534
>>>>>
>>>>> The aligned addresses may not have a valid page table entry. So, in
case
>>>>> of uio, it is possible that during vtophys translation, the aligned
>>>>> addresses are touched here:
>>>>>
>>>>>
>
https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>>>> and this could lead to a segfault. Is this a possible scenario?
>>>>>
>>>>> My second question is:
>>>>>
>>>>> The commit message here:
>>>>>
>>>>>
https://review.gerrithub.io/c/spdk/spdk/+/410071
>>>>>
>>>>> says:
>>>>>
>>>>> “We've had cases (especially with vhost) in the past where we
have
>>>>> a valid vaddr but the backing page was not assigned yet.”.
>>>>>
>>>>> This refers to the vhost target, where shared memory is allocated
by
> the
>>>>> QEMU process and the SPDK process maps this memory.
>>>>>
>>>>> Let’s consider this case. After mapping vhost-user memory regions,
> they
>>>>> are registered to the vtophys map. In case vfio is disabled,
>>>>> vtophys_get_paddr_pagemap() finds the corresponding physical
>>> addresses.
>>>>> These addresses must refer to pinned memory because vfio is not
> there
>>> to
>>>>> do the pinning. Therefore, VM’s memory has to be backed by
> hugepages.
>>>>> Hugepages are allocated by the QEMU process, way before vhost
>>> memory
>>>>> registration. After their allocation, hugepages will always have a
>>>>> backing page because they never get swapped out. So, I do not see
any
>>>>> such case where backing page is not assigned yet and thus I do not
see
>>>>> any need to touch the mapped page.
>>>>>
>>>>> This is my current understanding in brief and I'd welcome any
feedback
>>>>> you may have:
>>>>>
>>>>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
>>> because
>>>>> the aligned address may not have a valid page table entry thus
>>>>> triggering a segfault when being touched in
>>>>> vtophys_get_paddr_pagemap() -> rte_atomic64_read().
>>>>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>>>>> because VM’s memory has to be backed by hugepages and
> hugepages
>>> are
>>>>> not handled by demand paging strategy and they are never swapped
>>> out.
>>>>> I am looking forward to your feedback.
>>>>>
>>>> Current start/end calculation in spdk_vhost_dev_mem_register() might
> be
>>> a actually
>>>> NOP for memory backed by hugepages.
>>> It seems so. However, there are other platforms that support hugepage
>>> sizes less than 2MB. I do not know if SPDK supports such platforms.
>> I think that currently only >=2MB HP are supported.
>>
>>>> I think that we can try to validate alignmet of the memory in
>>> spdk_vhost_dev_mem_register()
>>>> and fail if it is not 2MB aligned.
>>> This sounds reasonable to me. However, I believe it would be better if
>>> we could support registering non-2MB aligned virtual addresses. Is this
>>> a WIP? I have found this commit:
>>>
>>>
https://review.gerrithub.io/c/spdk/spdk/+/427816/1
>>>
>>> It is not clear to me why the community has chosen 2MB granularity for
>>> the SPDK map tables.
>> SPKD vhost was created some time after iSCSI and NVMf targets and it
>> needs to obey existing limitations. To be honest, vhost don't really need
>> to use huge pages, as this is the limitation of:
>>
>> 1. DMA - memory passed to DMA need to be:
>> - pinned memory - can't be swapped, physical address can't change
>> - contiguous (VFIO complicate this case)
>> - virtual address must have assigned huge page so SPKD can discover
>> its physical address
>>
>> 2. env_dpdk/memory
>> this was implemented for NVMe drivers that have limitations that single
>> transaction can't span 2MB address boundary - PRP have this limitation
>> I don't know if SGLs overcome this. This also required from us to implement
>> this in vhost:
>>
https://github.com/spdk/spdk/blob/master/lib/vhost/vhost.c#L462
>>
>> This is why 2MB granularity was chosen.
> So, you are saying that vhost doesn’t really need to use huge pages. Are
It is possible for SPDK vhost backend to be modified in a way that it won't
require hugepages. But again when passing payload descriptors down to
physical devices the memory must be "good" for them. So if you use
bdev_malloc
(without IOAT acceleration!) or bdev_aio as backing device the hugepages
backed memory requirement disappear as host kernel will handle all page
faults for you. This is not true for other bdevs that use DMA like nvme.
Agreed.
> you referring to SPDK’s memory? This would make sense. And I
think, this
> is also true for the nvme and virtio-scsi bdev modules, which I am
> currently using. In these cases, the storage backend performs zero-copy
> DMA directly from VM’s huge page backed memory. Is this correct?
For virtio-scsi bdev it is (might be) correct but not for nvme (bdev_nvme?).
Basically, I was referring to a local NVMe drive and I had the SPDK NVMe
PCIe driver in mind. I guess you say “no” for NVMe because the NVMe bdev
module handles both locally attached and remote NVMe drives. So, in case
of a locally attached NVMe drive, is the DMA operation zero-copy?
> As far as VM’s memory is concerned, is it true that huge page
backed
> memory is just a limitation of uio? Is it necessary to use huge page
> backed memory for the VM in case of vfio?
>
This is the question that VFIO kernel module developers could have answare
for. But I bet $5 that it is NOT true. Let me write this again: memory
for DMA need to be:
1. Pinned
2. vtophys(addr) translation need to possible during memory registration
3. vtophys(addr) must always return the same result for the same 'add'
Kernel can do all above for any pages at any time but in userspace, only
hugepages guarantee all these so we are using them.
I think this is not true. I think that the vfio kernel module can do the
job. In case of x86 architecture with an IOMMU, the vfio kernel module
exposes an ioctl type called “VFIO_IOMMU_MAP_DMA”. This is used by SPDK
to register the user space memory that will be used for DMA. The vfio
serves this ioctl by basically doing two things:
- pin the registered user space memory. This means that this memory will
never get swapped out or moved to another physical address. This is
done here:
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_ty...
- program the IOMMU. The kernel IOMMU driver will insert the appropriate
entries in the device IOVA domain in a way that the device will be
seeing this memory as contiguous. This means that the registered
memory, although it might be physically scattered, it will be mapped
to a contiguous IOVA segment. This is done here:
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_ty...
So, I believe that the vfio kernel module serves the DMA memory
limitations you ‘ve already mentioned, but I will post a relevant
question in the vfio-users mailing list to get more feedback on this.
There is interesting article here
https://lwn.net/Articles/600502/
about DMA
and memory. Maybe it an describe it better than me :)
Let me add one more question:
Why virtio-scsi and virtio-blk bdev moludes do not support
VIRTIO_F_IOMMU_PLATFORM feature? Have you tested these two bdevs with
the presence of a vIOMMU in QEMU?
Here is the problematic scenario I have in mind:
Let’s say we have a VM with a vIOMMU and a virtio-scsi HBA with a couple
of SCSI disks which we want to use as storage backends for an SPDK
target app. The SPDK virtio-scsi bdev driver does not support the
VIRTIO_F_IOMMU_PLATFORM feature. This means that the device will always
bypass the vIOMMU for the DMA operations. So, in this case, physical
addresses must still be provided to the device by the SPDK virtio
driver, even though an IOMMU appears to be present. The problem is that
the virtio driver passes IOVAs instead of physical addresses. This is
done here:
https://github.com/spdk/spdk/blob/master/lib/virtio/virtio.c#L538
(Actually, it passes the address kept in vtophys map table. The vtophys
map keeps physical addresses in case vfio is disabled and IOVAs in case
vfio is enabled.)
>>>> Have you hit any segfault there?
>>> Yes. I will give you a brief description.
>>>
>>> As I have already announced here:
>>>
>>>
https://lists.01.org/pipermail/spdk/2018-October/002528.html
>>>
>>> I am currently working on an alternative vhost-user transport. I am
>>> shipping the SPDK vhost target into a dedicated storage appliance VM.
>>> Inspired by this post:
>>>
>>>
https://wiki.qemu.org/Features/VirtioVhostUser
>>>
>>> I am using a dedicated virtio device called “virtio-vhost-user” to
>>> extend the vhost-user control plane. This device intercepts the
>>> vhost-user protocol messages from the unix domain socket on the host
> and
>>> inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
>>> from the unix socket, it maps the memory regions set by the master and
>>> exposes them to the slave guest as an MMIO PCI memory region.
>>>
>>> So, instead of mapping hugepage backed memory regions, the vhost
> target,
>>> running in slave guest user space, maps segments of an MMIO BAR of the
>>> virtio-vhost-user device.
>>>
>>> Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
>>> The segfault is happening in a specific test case. That is when I do
>>> “construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
>>> “construct_vhost_scsi_controller”.
>>> In my code, this implies calling “spdk_pci_device_attach” ->
>>> “spdk_pci_device_detach” -> “spdk_pci_device_attach”
>>> which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” -
>>> “rte_pci_map_device”.
>>>
>>> During the first map, the MMIO BAR is always mapped to a 2MB aligned
>>> address (btw I can’t explain this, it can’t be a coincidence).
>>> However, this is not the case after the second map. The result is that I
>>> get a segfault when I register this non-2MB aligned address.
>>>
>>> So, I am seeking for a solution. I think the best would be to support
>>> registering non-2MB aligned addresses. This would be useful in general,
>>> when you want to register an MMIO BAR, which is necessary in cases of
>>> peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
>>> between NVMe SSDs in SPDK. I wonder how you manage the 2MB
> alignment
>>> restriction in that case.
>> Anything that you don't pass to DMA don't need to be 2MB aligned. If
you
>> read/write this using CPU it don't need to be HP backed either.
>>
>> For DMA I think you will have to obey memory limitation I wrote above.
>>
>> Adding Darek, he can have some more (up to date) knowledge.
> OK, let me get this a little bit more clear. The dataplane is unchanged.
> The vhost target passes all the received descriptor addresses to the
> underlying storage backend for DMA (after address translation and iovec
> splitting). What I did was just to change the way the vhost target
> accesses the VM’s memory.
>
> The previous case was that the vhost target was running on the host and
> it mapped the master vhost memory regions sent over the unix socket.
> These memory regions relied on huge pages on the host physical memory.
>
> The current case is that the vhost target is running inside a VM and
> needs to have access to the other VM’s memory lying on host hugetlbfs.
> Therefore, I use a special device called virtio-vhost-user, which maps
> the master vhost memory regions and exposes them to guest user space as
> an MMIO BAR. That’s how the vhost target has access to host hugetlbfs
> from guest user space.
>
> So, the current case is that the storage backend (say an emulated NVMe
> controller) performs peer-to-peer DMA from this MMIO BAR. This requires
> that the vhost target has registered this BAR to the vtophys map. And
> here is the problem because spdk_mem_register() requires the address to
> be 2MB aligned but the MMIO BAR is not necessarily mapped to a 2MB
> aligned virtual address.
>
> Currently, I am using a temporary solution. I am mapping all PCI BARs
> from all PCI devices to 2MB aligned virtual addresses. I think this is
> not going to trigger any implications, is it? The other solution, is to
Should be fine for 2MB huge pages. The mmap() might fail for hugepages >2MB.
> modify the env_dpdk library in order to allow registering non-2MB
> aligned addresses. Darek, in case you are reading this, I would
> appreciate any feedback at this point. I think you are working on this.
>
>>> Last but not least, in case you may know, I would appreciate if you
>>> could give me a situation where page touching in
>>> vtophys_get_paddr_pagemap() here:
>>>
>>>
>
https://github.com/spdk/spdk/blob/master/lib/env_dpdk/vtophys.c#L287
>>> is necessary. Is this related to vhost exclusively? In case of vhost,
>>> the memory regions are backed by hugepages and these are not allocated
>>> on demand by the kernel. What am I missing?
>> When you mmap() huge page you are getting virtual address but actual
>> physical hugepage might not be assigned yet. We are touching each page
>> to force kernel to assign the huge page to virtual addrsss so we can
> discover
>> vtophys mmaping.
>>
>>>>> Thanks,
>>>>> Nikos
>>>>>
>>>>> _______________________________________________
>>>>> SPDK mailing list
>>>>> SPDK(a)lists.01.org
>>>>>
https://lists.01.org/mailman/listinfo/spdk
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>>
https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>>
https://lists.01.org/mailman/listinfo/spdk
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>>
https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk