From: SPDK [mailto:email@example.com] On Behalf Of Nikos Dragazis
Sent: Saturday, November 10, 2018 3:37 AM
Subject: Re: [SPDK] Questions about vhost memory registration
On 8/11/18 10:45 π.μ., Wodkowski, PawelX wrote:
>> -----Original Message-----
>> From: SPDK [mailto:firstname.lastname@example.org] On Behalf Of Nikos
>> Sent: Thursday, November 8, 2018 1:49 AM
>> To: spdk(a)lists.01.org
>> Subject: [SPDK] Questions about vhost memory registration
>> Hi all,
>> I would like to raise a couple of questions about vhost target.
>> My first question is:
>> During vhost-user negotiation, the master sends its memory regions to
>> the slave. Slave maps each region in its own address space. The mmap
>> addresses are page aligned (that is 4KB aligned) but not necessarily 2MB
>> aligned. When vhost registers the memory regions in
>> spdk_vhost_dev_mem_register(), it aligns the mmap addresses to 2MB
> Yes, page aligned, but not PAGE_SIZE (4k) aligned. SPDK vhost require that
> pass memory backed by huge pages >= 2MB in size. On x86 MMU this imply
> that page alignment is the same as page size which is >= 2MB (99% sure -
> can someone confirm this to get this +1% ;) ).
Yes, you are probably right. I didn’t know how the kernel achieves
having a single page table entry for a contiguous 2MB virtual address
range. If I get this right, in case of x86_64, the answer is using a
page middle directory (PMD) entry pointing directly to a 2MB physical
page rather than to a lower-level page table. And since the PMDs are 2MB
aligned by definition, the resulting virtual address will be 2MB
>> The aligned addresses may not have a valid page table entry. So, in case
>> of uio, it is possible that during vtophys translation, the aligned
>> addresses are touched here:
>> and this could lead to a segfault. Is this a possible scenario?
>> My second question is:
>> The commit message here:
>> “We've had cases (especially with vhost) in the past where we have
>> a valid vaddr but the backing page was not assigned yet.”.
>> This refers to the vhost target, where shared memory is allocated by the
>> QEMU process and the SPDK process maps this memory.
>> Let’s consider this case. After mapping vhost-user memory regions, they
>> are registered to the vtophys map. In case vfio is disabled,
>> vtophys_get_paddr_pagemap() finds the corresponding physical
>> These addresses must refer to pinned memory because vfio is not there
>> do the pinning. Therefore, VM’s memory has to be backed by hugepages.
>> Hugepages are allocated by the QEMU process, way before vhost
>> registration. After their allocation, hugepages will always have a
>> backing page because they never get swapped out. So, I do not see any
>> such case where backing page is not assigned yet and thus I do not see
>> any need to touch the mapped page.
>> This is my current understanding in brief and I'd welcome any feedback
>> you may have:
>> 1. address alignment in spdk_vhost_dev_mem_register() is buggy
>> the aligned address may not have a valid page table entry thus
>> triggering a segfault when being touched in
>> vtophys_get_paddr_pagemap() -> rte_atomic64_read().
>> 2. touching the page in vtophys_get_paddr_pagemap() is unnecessary
>> because VM’s memory has to be backed by hugepages and hugepages
>> not handled by demand paging strategy and they are never swapped
>> I am looking forward to your feedback.
> Current start/end calculation in spdk_vhost_dev_mem_register() might be
> NOP for memory backed by hugepages.
It seems so. However, there are other platforms that support hugepage
sizes less than 2MB. I do not know if SPDK supports such platforms.
I think that currently only >=2MB HP are supported.
> I think that we can try to validate alignmet of the memory in
> and fail if it is not 2MB aligned.
This sounds reasonable to me. However, I believe it would be better if
we could support registering non-2MB aligned virtual addresses. Is this
a WIP? I have found this commit:
It is not clear to me why the community has chosen 2MB granularity for
the SPDK map tables.
SPKD vhost was created some time after iSCSI and NVMf targets and it
needs to obey existing limitations. To be honest, vhost don't really need
to use huge pages, as this is the limitation of:
1. DMA - memory passed to DMA need to be:
- pinned memory - can't be swapped, physical address can't change
- contiguous (VFIO complicate this case)
- virtual address must have assigned huge page so SPKD can discover
its physical address
this was implemented for NVMe drivers that have limitations that single
transaction can't span 2MB address boundary - PRP have this limitation
I don't know if SGLs overcome this. This also required from us to implement
this in vhost:
This is why 2MB granularity was chosen.
> Have you hit any segfault there?
Yes. I will give you a brief description.
As I have already announced here:
I am currently working on an alternative vhost-user transport. I am
shipping the SPDK vhost target into a dedicated storage appliance VM.
Inspired by this post:
I am using a dedicated virtio device called “virtio-vhost-user” to
extend the vhost-user control plane. This device intercepts the
vhost-user protocol messages from the unix domain socket on the host and
inserts them into a virtqueue. In case a SET_MEM_TABLE message arrives
from the unix socket, it maps the memory regions set by the master and
exposes them to the slave guest as an MMIO PCI memory region.
So, instead of mapping hugepage backed memory regions, the vhost target,
running in slave guest user space, maps segments of an MMIO BAR of the
Thus, in my case, the mapped addresses are not necessarily 2MB aligned.
The segfault is happening in a specific test case. That is when I do
“construct_vhost_scsi_controller” -> “remove_vhost_controller” ->
In my code, this implies calling “spdk_pci_device_attach” ->
“spdk_pci_device_detach” -> “spdk_pci_device_attach”
which in turn implies “rte_pci_map_device” -> “rte_pci_unmap_device” ->
During the first map, the MMIO BAR is always mapped to a 2MB aligned
address (btw I can’t explain this, it can’t be a coincidence).
However, this is not the case after the second map. The result is that I
get a segfault when I register this non-2MB aligned address.
So, I am seeking for a solution. I think the best would be to support
registering non-2MB aligned addresses. This would be useful in general,
when you want to register an MMIO BAR, which is necessary in cases of
peer-to-peer DMA. I know that there is a use case for peer-to-peer DMA
between NVMe SSDs in SPDK. I wonder how you manage the 2MB alignment
restriction in that case.
Anything that you don't pass to DMA don't need to be 2MB aligned. If you
read/write this using CPU it don't need to be HP backed either.
For DMA I think you will have to obey memory limitation I wrote above.
Adding Darek, he can have some more (up to date) knowledge.
Last but not least, in case you may know, I would appreciate if you
could give me a situation where page touching in
is necessary. Is this related to vhost exclusively? In case of vhost,
the memory regions are backed by hugepages and these are not allocated
on demand by the kernel. What am I missing?
When you mmap() huge page you are getting virtual address but actual
physical hugepage might not be assigned yet. We are touching each page
to force kernel to assign the huge page to virtual addrsss so we can discover
>> SPDK mailing list
> SPDK mailing list
SPDK mailing list