How accelerate Ceph via SPDK
by We We
Hi
In the ceph's source code(https://github.com/ceph/ceph <https://github.com/ceph/ceph>), we can see:
1. ceph <https://github.com/ceph/ceph>/src <https://github.com/ceph/ceph/tree/master/src>/os <https://github.com/ceph/ceph/tree/master/src/os>/bluestore/BlockDevice.cc <https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlockDevice.cc>:
#if defined(HAVE_SPDK) if (type == "ust-nvme") {
return new NVMEDevice(cct, cb, cbpriv);
}
#endif
There is a comment in the code which means it is no effect. I couldn't find anything about accelerating ceph via spdk in the code,
so I guess there has been none work done in Bluestore for this. Accelerating Ceph osd backed relies on bluestone. Bluestone use a new store to
implement lockless, asynchronous and high performance storage service.
2. spdk <https://github.com/ceph/spdk/tree/7b7f2aa6854745caf6e2803133043132ca400285>/lib <https://github.com/ceph/spdk/tree/7b7f2aa6854745caf6e2803133043132ca40028...>/bdev <https://github.com/ceph/spdk/tree/7b7f2aa6854745caf6e2803133043132ca40028...>/rbd <https://github.com/ceph/spdk/tree/7b7f2aa6854745caf6e2803133043132ca40028...>/bdev_rbd.c:
#include "spdk/conf.h”
#include "spdk/env.h”
#include "spdk/log.h”
#include "spdk/bdev.h”
#include "spdk/io_channel.h”
The source code includes a series of workflow for ceph operation, such as:rados_create(cluster, NULL),
rados_conf_read_file(*cluster, NULL), rados_connect(*cluster),spdk_io_channel_get_ctx(ch), etc.There
are some accelerating about the client I/O performance via spdk on Ceph Cluster in the bdev_rbd.c file
by using the poller mode not interrupt mode in SPDK to accelerate Ceph Clusters.
Therefore, we can guess spdk did not work in osd backend, rather than in front of osd. And in the latest
version of spdk, blobstore comes. It also implement lockless, asynchronous and high performance storage service.
Can we replace the bluestone with the blobstore in osd backend via spdk.
Am I wrong? Could someone offer the help for me? To tell me more detail.
Thankx.
Helloway
3 years, 6 months
SPDK perf starting I/O failed
by Oza Oza
Hi All,
SPDK perf test queue size more than 8187 fails.
Test procedure:
1. Increase the number of huge pages to 4096 - that means total huge page
memory reserved is 4096 * 2048KB. that means 8GB
$echo 4096 >/proc/sys/vm/nr_hugepages
2. Run the perf test with queue size as 8188 (or above) and DPDK memory
allocation to be around 6GB.
/usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe traddr:0001:01:00.0' -q
8188 -s 2048 -w read -d 6144 -t 30 -c 0x1
3. Observer the test fails at "spdk_nvme_ns_cmd_read" for one request. (it
seems 8187 requests succeeds, any number above that fails
4. Observe the aplication also hangs
5. run the same test with queue size as 8187 and observe the test passes
/usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe traddr:0001:01:00.0' -q
8187 -s 2048 -w read -d 6144 -t 30 -c 0x1
THE COMMAND OUTPUT IS:
root@bcm958742t:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
traddr:0001:01:00.0' -q 8188 -s 2048 -w read -d 6144 -t 30 -c 0x1
Starting DPDK 16.11.1 initialization...
[ DPDK EAL parameters: perf -c 1 -m 6144 --file-prefix=spdk_pid10557 ]
EAL: Detected 8 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: cannot open /proc/self/numa_maps, consider that all memory is in
socket_id 0
Initializing NVMe Controllers
EAL: PCI device 0001:01:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
[151516.159960] vfio-pci 0001:01:00.0: enabling device (0400 -> 0402)
[151516.272405] vfio_ecap_init: 0001:01:00.0 hiding ecap 0x19@0x2a0
Attaching to NVMe Controller at 0001:01:00.0 [8086:0953]
Attached to NVMe Controller at 0001:01:00.0 [8086:0953]
Associating INTEL SSDPEDMW400G4 (CVCQ6433008P400AGN ) with lcore 0
Initialization complete. Launching workers.
Starting thread on core 0
starting I/O failed
Regards,
Oza.
3 years, 6 months
NVMF initiator error: Failed to open /dev/nvme-fabrics: No such file or directory
by Sunil Vettukalppurathu
Hi,
I am trying to run initiator on machine and the target on another machine
using rdma.
Once the controller is initialized (qpairs are created), it fails in
add_ctrl() saying:
*Failed to open /dev/nvme-fabrics: No such file or directory*
Who should be creating this file: /dev/nvme-fabrics ?
Looking at code, nowhere it's created - at least I can not see it.
If I load linux nvme_fabrics module, /dev/nvme-fabrics is created. But
then it would be taking the linux route.... right?
Thanks for the help. Appreciate it.
--sunil
3 years, 6 months
Announcing the very first SPDK Developer Meetup!
by Luse, Paul E
Come and join the very first SPDK Developer Meetup! It's an excellent opportunity for networking, learning and making forward progress on the code and generally making the community more productive.
There won't be any presentations at this meeting, instead be prepared for the following types of activities:
* open discussion / brainstorming new features or process improvements
* white board design work
* code walkthroughs
* focused work on getting specific patches approved
We'll kick off the meeting with some agenda building work and then break off in as many groups as makes sense based on the topics that have the most interest.
Seating is limited so please RSVP as soon as possible!! We're scheduling things to try and make it possible for most to fly in Monday morning and out Wednesday afternoon. Intel will be providing lunch and snacks Tue/Wed and we'll all go out for a group dinner Tue evening.
DATES: Mon 11/6 start at 1:00 through Wed 11/8 end at 1:00
Evite Link: http://evite.me/DTTUNyNGw4
See you there..
Paul
3 years, 6 months
SPDK golang bindings?
by Hao Luo
Hi, all,
Does current SPDK provide golang binding? Or is it on the roadmap?
Thanks.
Hao
3 years, 6 months
NVMe driver use of PAGE_SIZE?
by Lance Hartmann ORACLE
I’m writing with hopes of gaining an understanding into the SPDK NVMe driver’s use of PAGE_SIZE. I understand from the NVMe specification that during initialization the driver may read the NVMe Controller Capabilities (CAP) register and learn of the controller’s minimum (MPSMIN) and maximum (MPSMAX) memory page sizes supported. Then, during initialization, the driver writes memory page size (MPS) field in the NVMe Controller Configuration (CC) register with that value being used for the PRP entry size. I’ve walked through the SPDK NVME driver code where it breaks up a transfer into multiple PRP entries, etc. Here and there, I see the common PAGE_SIZE macro employed. This is defined on my system at 4096. All fine and good, except…. How does this figure in with hugepages? examples/nvme/perf/perf.c allocates its buffer using spdk_dma_zmalloc(), and the SPDK NVMe driver itself also makes use of this call to allocate memory. This is built atop the DPDK’s EAL layer such that it invokes rte_malloc_socket(). Assuming that the memory is allocated from the hugepages (2MB) heap, then obviously its corresponding page size is quite larger than PAGE_SIZE.
Is it fair to assume that the SPDK NVMe driver’s use of PAGE_SIZE in various places is “safe” as long as it’s less than the actual page size (assuming, again, that the actually allocated page at that time is from a 2MB page)? I’m curious to learn if perhaps this was a driver implementation decision as it would be difficult for it to know (or try to learn, especially within a “reasonable amount of time") the page size of a memory buffer supplied to it. Continuing with that line of logic, the driver code itself would then have additional challenges (e.g. how the trackers are statically built with 4KB page in mind). I hope this doesn’t come across as picking apart the driver. Far from it! It’s beautiful work! My questions are purely in trying to understand how these pieces work together.
thanks,
--
Lance Hartmann
lance.hartmann(a)oracle.com
3 years, 6 months
SPC 3/4 support?
by Mark Kim
Hello,
Is there a plan to support SCSI Reservation/PR and ALUA?
Thanks,
Mark Kim
3 years, 6 months
A issue about maximums of write latency when we access the same block consecutively.
by 储
Hi, all
Recently, we use a demo to obverse the latency.
We find that when we access the same block consecutively, the occurrence of maximum latencies will become more frequent.
Additionally, they can reach even 2-3 ms and present a periodical change.
Why?
(1) For the same block, the latency of the first accessing is about 10-12 μs while the second, third
and the forth accessing can reach 700-900 μs even 2-3 ms?
I want to know the reason why the operation difference between the first accessing and the others exists.
(2) Why the maximums of 2-3 ms have a periodical change?
Best wishes,
Jiajia Chu
3 years, 6 months
Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
by Lance Hartmann ORACLE
Hello,
I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading comment:
/*
* For commands requiring more than 2 PRP entries, one PRP will be
* embedded in the command (prp1), and the rest of the PRP entries
* will be in a list pointed to by the command (prp2). This means
* that real max number of PRP entries we support is 506+1, which
* results in a max xfer size of 506*PAGE_SIZE.
*/
in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec. I’d greatly appreciate if someone could “show me the math” or otherwise help me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506) derived? I don’t know if I’m lost in the semantics of the naming, the comment, or perhaps there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise, that the max # of PRP entries would be a function of the PAGE_SIZE.
I did see that the driver in nvme_ctrlr_identify() compares this derived maximum transfer size with that which the controller can actually support as reported in the Identify Controller structure, choosing the minimum of the two values, but that’s understood and separate from the above.
regards,
--
Lance Hartmann
lance.hartmann(a)oracle.com
3 years, 6 months
nvme_ctrlr.c:1224:nvme_ctrlr_process_init: ***ERROR*** Initialization timed out in state 3
by Oza Oza
I have ported SPDK for ARMv8.
And DPDK is compiled with version: 16.11.1
init controller is failing.
root@ubuntu:/home/oza/SPDK/spdk#
odepth=128 --size=4G --readwrite=read --filename=0000.01.00.00/1 --bs=4096
--i
/home/oza/fio /home/oza/SPDK/spdk
EAL: pci driver is being registered 0x1nreadtest: (g=0): rw=read,
bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=spdk_fio, iodepth=128
fio-2.17-29-gf0ac1
Starting 1 process
Starting Intel(R) DPDK initialization ...
[ DPDK EAL parameters: fio -c 1 --file-prefix=spdk_pid6448
--base-virtaddr=0x1000000000 --proc-type=auto ]
EAL: Detected 8 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: cannot open /proc/self/numa_maps, consider that all memory is in
socket_id 0
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
EAL: vfio_group_fd=11 iommu_group_no=3 *vfio_dev_fd=13
EAL: reg=0x2000 fd=13 cap_offset=0x50
EAL: the msi-x bar number is 0 0x2000 0x200
EAL: Hotplug doesn't support vfio yet
spdk_fio_setup() is being called
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: vfio_group_fd=11 iommu_group_no=3 *vfio_dev_fd=16
EAL: reg=0x2000 fd=16 cap_offset=0x50
EAL: the msi-x bar number is 0 0x2000 0x200
EAL: inside pci_vfio_write_config offset=4
nvme_ctrlr.c:1224:nvme_ctrlr_process_init: ***ERROR*** Initialization
timed out in state 3
nvme_ctrlr.c: 403:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown within
5 seconds
EAL: Hotplug doesn't support vfio yet
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: vfio_group_fd=11 iommu_group_no=3 *vfio_dev_fd=18
EAL: reg=0x2000 fd=18 cap_offset=0x50
EAL: the msi-x bar number is 0 0x2000 0x200
EAL: Hotplug doesn't support vfio yet
EAL: Requested device 0000:01:00.0 cannot be used
spdk_nvme_probe() failed
Regards,
Oza.
3 years, 7 months